Split Large File Into Smaller Files Using Parallel Stream in Java

Split Large File Into Smaller Files Using Parallel Stream in Java - java

For a homework assignment, I need to implement external sorting such that I can sort a 10GB file with 1GB physical memory. Currently, I'm using a BufferedReader on the large file and constructing/sorting the smaller files sequentially. Then in the merge step, I have BufferedReaders open for all small files and a single BufferedWriter for the large final file where I write to the large file using the merge k sorted lists algorithm with a PriorityQueue. This works, but it needs to be faster (take half as much time to be exact).
The entire splitting step happens sequentially and the entire merging step also happens sequentially. I think I can at least split and sort the files in parallel using multiple threads with different virtual memory spaces. Then the memory used is mostly memory-mapped files and the OS will take care of optimally paging in and out data from physical memory. I was wondering if there was a way for Java to do this using parallel streams. Something along the lines of:
largeFile.splitInParallel(100000)
.lines()
.map((s) -> new LineObject(s))
.sorted()
.forEach(writeSmallFileToDisk)
where the argument to splitInParallel is the number of lines I want in the smaller files. Any help is appreciated, thanks!
EDIT:
My code is
public class Main {
private static final int BUFFER_SIZE = 10_000_000;
/**
* A main method to run examples.
*
* #param args not used
*/
public static void main(String[] args) throws IOException {
System.out.println("Starting...");
String file = args[0];
int batchSize = Integer.parseInt(args[1]);;
try {
FileInputStream fin = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fin, BUFFER_SIZE);
BufferedReader br = new BufferedReader(new InputStreamReader(bis), BUFFER_SIZE);
int lineNumber = 0;
int batchId = 0;
String line;
TaxiEntry[] batch = new TaxiEntry[batchSize];
int i = 0;
while ((line = br.readLine()) != null) {
TaxiEntry taxiEntry = parseLine(line);
batch[i++] = taxiEntry;
lineNumber++;
if (lineNumber % batchSize == 0) {
String outputFileName = String.format("batches/batch_%d.txt", batchId);
BufferedWriter bf = new BufferedWriter(new FileWriter(outputFileName, true), BUFFER_SIZE);
Arrays.parallelSort(batch);
for (int j = 0; j < i; j++) {
bf.write(batch[j].toString());
if (j != i) {
bf.newLine();
}
}
batchId++;
i = 0;
bf.flush();
}
}
String outputFileName = String.format("batches/batch_%d.txt", batchId);
BufferedWriter bf = new BufferedWriter(new FileWriter(outputFileName, true), BUFFER_SIZE);
Arrays.parallelSort(batch, 0, i);
for (int j = 0; j < i; j++) {
bf.write(batch[j].toString());
if (j != i) {
bf.newLine();
}
}
batchId++;
bf.flush();
System.out.println("Processed " + lineNumber + " lines");
merge(batchId);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void merge(int numBatches) throws IOException {
System.out.println("Starting merge...");
// Open readers
BufferedReader[] readers = new BufferedReader[numBatches];
for (int i = 0; i < numBatches; i++) {
String file = String.format("batches/batch_%d.txt", i);
FileInputStream fin = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fin, BUFFER_SIZE);
BufferedReader br = new BufferedReader(new InputStreamReader(bis), BUFFER_SIZE);
readers[i] = br;
}
// Merge
String outputFileName = "result/final.txt";
BufferedWriter bf = new BufferedWriter(new FileWriter(outputFileName, true), BUFFER_SIZE);
PriorityQueue<IndexedTaxiNode> curEntries = new PriorityQueue<>();
for (int i = 0; i < numBatches; i++) {
BufferedReader reader = readers[i];
String next = reader.readLine();
if (next != null) {
TaxiEntry curr = parseLine(next);
curEntries.add(new IndexedTaxiNode(curr, i));
}
}
while (!curEntries.isEmpty()) {
// get max from curEntries
IndexedTaxiNode maxNode = curEntries.remove();
bf.write(maxNode.toString());
bf.newLine();
int index = maxNode.index;
String next = readers[index].readLine();
if (next != null) {
TaxiEntry newEntry = parseLine(next);
curEntries.add(new IndexedTaxiNode(newEntry, index));
}
}
bf.flush();
}
public static TaxiEntry parseLine(String line) {
return new TaxiEntry(line, Double.parseDouble(line.split(",")[16]));
}
}

Doing some timings. I found that the time to read from disk and the time to
do a sort are similar order of magnitude.
System.out.println("Begin loading file");
// do loading stuff
System.out.format("elapsed %.03f ms%n%n", (finishTime - startTime) / 1e6);
System.out.println("Sorting lines");
// do sorting stuff
System.out.format("elapsed %.03f ms%n", (finishTime - startTime) / 1e6);
Console output is:
Begin loading file
elapsed 918.933 ms
Sorting lines
elapsed 1360.896 ms
I used a modest file of about 150 MB for the timings. It might not be a good idea to have lots of threads all reading from disk at the same time.
My suggestion for what it's worth is to have one thread that does all of the disk reading, and another thread that concurrently does sorting. I could only see a way to do this for the splitting and sorting phase.
For the splitting phase, you cannot read all the segments in one go because that would consume too much memory. So you read a few segments, write a few, read a few, and so on. The idea of this interleaving, is to ensure the disk is continuously kept busy, by delegating the sorting operation to another thread. Hopefully by the time the disk is ready to write a segment the sort on that segment has completed so the disk never has to wait.
List<String> lines = new ArrayList<>();
int i = 0;
while (someCondition()) {
String line = reader.readLine();
lines.add(line);
if (lines.size() == BATCH_SIZE) {
sendMsgToWorker(lines); // send to worker thread
if (i == MAX_MESSAGE_QUEUE - 1) {
for (int j = 0; j < MAX_MESSAGE_QUEUE; j++) {
List<String> sortedLines = waitForLineFromWorker(); // wait for worker thread
writeTmpFile(sortedLines);
}
}
lines = new ArrayList<>();
i = (i + 1) % MAX_MESSAGE_QUEUE;
}
}
An outline for the splitting and sorting phase is shown above, without covering any edge cases. The amount of memory used would be proportional to BATCH_SIZE * MAX_MESSAGE_QUEUE.
Unfortunately, I don't see a way to apply concurrency to the phase of merging the multiple files. The disk is just the disk so cannot go any faster even with multiple threads.
You could try investigating parallel quicksort, but the problem with quicksort is choosing a pivot point so that the partitions end up a reasonable size.

Related

Java FileChannel Vs BufferedReader - Spring Batch - Reader

We process huge files (sometimes 50 GB each file). The application reads this one file and based on the business logic, it will write multiple output files (4-6).
The records in the file are of variable length and each field in a record is a delimiter separated.
Going by the understanding that reading a file using FileChannel with a ByteBuffer was always better than using a BufferedReader.readLine and then using a split by the delimiter.
BufferSizes tried 10240(10KB) and even more
Commit interval - 5000, 10000 etc
Below is how we used file channel to read:
Read byte by byte. Check if the read byte is a new line char(10) -
which means end of line.
check for delimiter bytes. capture the bytes read in a byte array(we initialized this byte array with a maximum field size of 350 bytes) until delimiter bytes are encountered.
convert these bytes read until this time, to String using UTF-8 encoding - new String(byteArr, 0, index,"UTF-8") to be specific - index is the number of bytes read until delimiter.
Using this method of reading the file using FileChannel took 57 minutes to process the file.
We want to decrease this time and tried using BufferredReader.readLine() and then use a split by delimiter, to see how it fares.
And shockingly the same file completed processing only in 7 minutes.
What's the catch here? Why FileChannel is taking more time than a buffered reader and then using a string split.
I was always under the assumption that ReadLine and Split combination will have a big performance impact?
Can any one throw light on if I was using FileChannel in a wrong way? One
Thanks in advance. Hope I have summarized the issue properly.
The below is sample code :
while (inputByteBuffer.hasRemaining() && (b = inputByteBuffer.get()) != 0){
boolean endOfField = false;
if (b == 10){
break;
}
else{
if (b == 94){//^
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b2 = inputByteBuffer.get();
if (b2 == 124){//|
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b3 = inputByteBuffer.get();
if (b3 == 94){//^
String field = new String(fieldBytes, 0, index, encoding);
if(fieldIndex == -1){
fields = new String[sizeFromAConfiguration];
}else{
fields[fieldIndex] = field;
}
fieldBytes = new byte[maxFieldSize];
endOfField = true;
fieldIndex++;
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b3, index);
}
}
else{
endOfFile = true;
//fields.add(new String(fieldBytes, 0, index, encoding));
fields[fieldIndex] = new String(fieldBytes, 0, index, encoding);
fieldBytes = new byte[maxFieldSize];
endOfField = true;
}
}else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
}
}else{
endOfFile = true;
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (endOfField){
index = 0;
}
else{
index++;
}
}

You're causing a lot of overhead with the constant hasRemaining()/read() checks as well as the constant get() calls. It would probably be better to get() the entire buffer into an array and process that directly, only calling read() when you get to the end.
And to answer a question in comments, you should not allocate a new ByteBuffer per read. This is expensive. Keep using the same one. And NB do not use a DirectByteBuffer for this application. It is not appropriate: it's only appropriate when you want the data to stay south of the JVM/JNI boundary, e.g. when merely copying between channels.
But I think I would throw this away, or rather rewrite it, using BufferedReader.read(), rather than readLine() followed by string splits, and using much the same logic as you have here, except of course that you don't need to keep calling hasRemaining() and filling the buffer, which BufferedReader will do automatically for you.
You have to take care to store the result of read() into an int, and to check it for -1 after every read().
It isn't clear to me that you should be using a Reader at all actually, unless you know you have multibyte text. Possibly a simple BufferedInputStream would be more appropriate.

While one cannot tell with certainty how a particular code will behave I would imagine the best way is to profile it just like you did.The FileChannel while percieved to be faster is actually not helping in your case.But this may not be because of reading from the file but actual processing that you do with the content you read.
One article I would like to point out while dealing with files is
https://www.redgreencode.com/why-is-java-io-slow/
Also the corresponding Github codebase
Java IO benchmark
I would like to point out this code to use a combination of both worlds
fos = new FileOutputStream(outputFile);
outFileChannel = fos.getChannel();
bufferedWriter = new BufferedWriter(Channels.newWriter(outFileChannel, "UTF-8"));
Since it is read in your case I will consider
File inputFile = new File("C:\\input.txt");
FileInputStream fis = new FileInputStream(inputFile);
FileChannel inputChannel = fis.getChannel();
BufferedReader bufferedReader = new BufferedReader(Channels.newReader(inputChannel,"UTF-8"));
Also I will tweak the chunksize and with Spring batch it is always trial and error to find sweet spot.
On a completely unrelated note the reason for your problem of not able to use BufferedReader is because of doubling of charecters and I am assuming this happens more commonly with ebcdic charecters.I will simply run a loop like this to identfy the troublemakers and eliminate at the source.
import java.io.UnsupportedEncodingException;
public class EbcdicConvertor {
public static void main(String[] args) throws UnsupportedEncodingException {
int index = 0;
for (int i = -127; i < 128; i++) {
byte[] b = new byte[1];
b[0] = (byte) i;
String cp037 = new String(b, "CP037");
if (cp037.getBytes().length == 2) {
index++;
System.out.println(i + "::" + cp037);
}
}
System.out.println(index);
}
}
The above answer is without testing my actual hypothesis.Here is an actual program to measure time.The results speak for themselves on a 200 MB file
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
public class ReadComplexDelimitedFile {
private static long total = 0;
private static final Pattern DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");
private void readFileUsingScanner() {
String s;
try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
while (stdin.hasNextLine()) {
s = stdin.nextLine();
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingCustomBufferedReader() {
try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReader() {
try (java.io.BufferedReader stdin = new java.io.BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
}
} catch (Exception e) {
System.err.println("Error");
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderByteFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
int b;
StringBuilder sb = new StringBuilder();
while ((b = stdin.read()) != -1) {
if (b == 10) {
total = total + DELIMITER_PATTERN.split(sb, 0).length;
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingFileChannelStream() {
try (RandomAccessFile fis = new RandomAccessFile(new File(this.getClass().getResource("input.txt").getPath()), "r")) {
try (FileChannel inputChannel = fis.getChannel()) {
ByteBuffer byteBuffer = ByteBuffer.allocate(8192);
ByteBuffer recordBuffer = ByteBuffer.allocate(250);
int recordLength = 0;
while ((inputChannel.read(byteBuffer)) != -1) {
byte b;
byteBuffer.flip();
while (byteBuffer.hasRemaining() && (b = byteBuffer.get()) != -1) {
if (b == 10) {
recordBuffer.flip();
total = total + splitIntoFields(recordBuffer, recordLength);
recordBuffer.clear();
recordLength = 0;
} else {
++recordLength;
recordBuffer.put(b);
}
}
byteBuffer.clear();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
private int splitIntoFields(ByteBuffer recordBuffer, int recordLength) {
byte b;
String[] fields = new String[17];
int fieldCount = -1;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < recordLength - 1; i++) {
b = recordBuffer.get(i);
if (b == 94 && recordBuffer.get(++i) == 124 && recordBuffer.get(++i) == 94) {
fields[++fieldCount] = sb.toString();
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
fields[++fieldCount] = sb.toString();
return fields.length;
}
public static void main(String args[]) {
//JVM wamrup
for (int i = 0; i < 100000; i++) {
total += i;
}
// We know scanner is slow-Still warming up
ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
List<Long> longList = new ArrayList<>(50);
for (int i = 0; i < 50; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingScanner();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingScanner");
longList.forEach(System.out::println);
// Actual performance test starts here
longList = new ArrayList<>(10);
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingCustomBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingCustomBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderByteFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderByteFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingFileChannelStream();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingFileChannelStream");
longList.forEach(System.out::println);
}
}
BufferedReader was written very long back and hence we can rewrite some parts relevant to this example.For instance we don't care about \r and skipLF or skipCR or those kinds of stuff
We are going to read the file( no need for syncrhonized)
By extension no need for StringBuffer even otherwise StringBuilder can be used.Performance improvement immediately seen.
dangerous hack,remove synchronized and replace StringBuffer with StringBuilder don't use it without proper testing and not knowing what you are doing
public String readLine() throws IOException {
StringBuilder s = null;
int startChar;
bufferLoop:
for (; ; ) {
if (nextChar >= nChars)
fill();
if (nextChar >= nChars) { /* EOF */
if (s != null && s.length() > 0)
return s.toString();
else
return null;
}
boolean eol = false;
char c = 0;
int i;
/* Skip a leftover '\n', if necessary */
charLoop:
for (i = nextChar; i < nChars; i++) {
c = cb[i];
if (c == '\n') {
eol = true;
break charLoop;
}
}
startChar = nextChar;
nextChar = i;
if (eol) {
String str;
if (s == null) {
str = new String(cb, startChar, i - startChar);
} else {
s.append(cb, startChar, i - startChar);
str = s.toString();
}
nextChar++;
return str;
}
if (s == null)
s = new StringBuilder(defaultExpectedLineLength);
s.append(cb, startChar, i - startChar);
}
}
Java 8 Intel i5 12 GB RAM Windows 10
Result:
Time taken for readFileUsingBufferedReaderFileChannel::
2581635057 1849820885 1763992972 1770510738 1746444157 1733491399
1740530125 1723907177 1724280512 1732445638
Time taken for readFileUsingBufferedReader
1851027073 1775304769 1803507033 1789979554 1786974538 1802675458
1789672780 1798036307 1789847714 1785302003
Time taken for readFileUsingCustomBufferedReader
1745220476 1721039975 1715383650 1728548462 1724746005 1718177466
1738026017 1748077438 1724608192 1736294175
Time taken for readFileUsingBufferedReaderByteFileChannel
2872857919 2480237636 2917488143 2913491126 2880117231 2904614745
2911756298 2878777496 2892169722 2888091211
Time taken for readFileUsingFileChannelStream
3039447073 2896156498 2538389366 2906287280 2887612064 2929288046
2895626578 2955326255 2897535059 2884476915
Process finished with exit code 0

I did try NIO with all possible options(provided in this post and to the best of my knowledge and research) and found that it no where came close to BufferedReader in terms of reading a text file.
Changing BufferedReader to use StringBuilder in place of StringBuffer, I don't see any significant improvement in performance (only very few seconds for some files and some of them were better using StringBuffer itself).
Removing synchronized block also didn't give much/any improvement. And it's not worth to tweak something by which we didn't receive any benefit.
The below is the time taken(reading, processing, writing - time taken for processing and writing is not significant - not even 20% of time) for file which is around 50 GB
NIO : 71.67 (Minutes)
IO (BufferedReader) : 10.84 (Minutes)
Thank you all for your time to reading and responding to this post and providing suggestions.

The main issue here is creating a new byte[] very rapidly(fieldBytes = new byte[maxFieldSize];).
Since for every iteration a new array is being created, garbage collection is being kicked off very often which triggers "stop the world" to reclaim the memory.
And also, the object creation could be expensive.
We could rather initialize the byte array once and then track the indexes to just convert the field to string with an end index.
And anyway, BufferedReader is faster than FileChannel, atleast to read the ASCII files, and to keep the code simple, we continued using Bufferred Reader itself.
Using Bufferred reader, the development and testing effort can be reduced by not having tedious logic to find delimiters and populating the object.

Java: Read up to x chars from a file into array

I want to read a text file and store its contents in an array where each element of the array holds up to 500 characters from the file (i.e. keep reading 500 characters at a time until there are no more characters to read).
I'm having trouble doing this because I'm having trouble understanding the difference between all of the different ways to do IO in Java and I can't find any that performs the task I want.
And will I need to use an array list since I don't initially know how many items are in the array?

It would be hard to avoid using ArrayList or something similar. If you know the file is ASCII, you could do
int partSize = 500;
File f = new File("file.txt");
String[] parts = new String[(f.length() + partSize - 1) / partSize];
But if the file uses a variable-width encoding like UTF-8, this won't work. This code will do the job.
static String[] readFileInParts(String fname) throws IOException {
int partSize = 500;
FileReader fr = new FileReader(fname);
List<String> parts = new ArrayList<String>();
char[] buf = new char[partSize];
int pos = 0;
for (;;) {
int nRead = fr.read(buf, pos, partSize - pos);
if (nRead == -1) {
if (pos > 0)
parts.add(new String(buf, 0, pos));
break;
}
pos += nRead;
if (pos == partSize) {
parts.add(new String(buf));
pos = 0;
}
}
return parts.toArray(new String[parts.size()]);
}
Note that FileReader uses the platform default encoding. To specify a specific encoding, replace it with new InputStreamReader(new FileInputStream(fname), charSet). It bit ugly, but that's the best way to do it.

An ArrayList will definitely be more suitable as you don't know how many elements you're going to have.
There are many ways to read a file, but as you want to keep the count of characters to get 500 of them, you could use the read() method of the Reader object that will read character by character. Once you collected the 500 characters you need (in a String I guess), just add it to your ArrayList (all of that in a loop of course).
The Reader object needs to be initialized with an object that extends Reader, like an InputStreamReader (this one take an implementation of an InputStream as parameter, a FileInputStream when working with a file as input).

Not sure if this will work, but you might want to try something like this (Caution: untested code):
private void doStuff() {
ArrayList<String> stringList = new ArrayList<String>();
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader("file.txt"));
String str;
int count = 0;
while ((str = in.readLine()) != null) {
String temp = "";
for (int i = 0; i <= str.length(); i++) {
temp += str.charAt(i);
count++;
if(count>500) {
stringList.add(temp);
temp = "";
count = 0;
}
}
if(count>500) {
stringList.add(temp);
temp = "";
count = 0;
}
}
} catch (IOException e) {
// handle
} finally {
try {
in.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

How to access the same file in two different places in Java

I want to read from file from two different places concurrently. I also want to use buffered i/o stream for efficiency. I tried to work out sth on my own given java API, but it's not working. Anybody will help? I need it for external merge-sort. Thanks for help!

You need to create a RandomAccessFile, which is basically Java's equivalent of C's memory mapped file.
I found an example of this:
try {
File file = new File("filename");
// Create a read-only memory-mapped file
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, (int)roChannel.size());
// Create a read-write memory-mapped file
FileChannel rwChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE, 0, (int)rwChannel.size());
// Create a private (copy-on-write) memory-mapped file.
// Any write to this channel results in a private copy of the data.
FileChannel pvChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer pvBuf = roChannel.map(FileChannel.MapMode.READ_WRITE, 0, (int)rwChannel.size());
} catch (IOException e) {
}
Edit, you stated you can't use a RandomAccessFile, which is the only way to skip up and down through the file. If you're stuck without it, then you must read the file sequentially, but that doesn't mean that you can't open multiple pointers to the same file for reading.
I put together the following test/sample and it shows clearly that you can open the file "twice" with different read pointers and sequentially sum two halves of the file. Again, if you need random access, you must use a RandomAccessFile, and that's what I'd suggest, but here you go:
public class FileTest {
public static void main(String[] args) throws IOException, InterruptedException, ExecutionException{
File temp = File.createTempFile("asfd", "");
BufferedWriter wrt = new BufferedWriter(new FileWriter(temp));
int testLength = 10000;
int numWidth = String.valueOf(testLength).length();
int targetSum = 0;
for(int i = 0; i < testLength; i++){
// each line guaranteed to have a good number of characters for our test
wrt.write(String.format("%0"+ numWidth +"d\n", i));
targetSum += i;
}
wrt.close();
BufferedReader rdr1 = new BufferedReader(new FileReader(temp));
BufferedReader rdr2 = new BufferedReader(new FileReader(temp));
rdr2.skip((numWidth+1)*testLength / 2); // skip first half of the lines
Summer sum1 = new Summer(rdr1, testLength / 2);
Summer sum2 = new Summer(rdr2, testLength / 2);
ExecutorService executor = Executors.newFixedThreadPool(2);
Future<Integer> halfSum1 = executor.submit(sum1);
Future<Integer> halfSum2 = executor.submit(sum2);
System.out.println("Total sum = " + (halfSum1.get() + halfSum2.get()) + " reference " + targetSum);
rdr1.close();
rdr2.close();
temp.delete();
}
private static class Summer implements Callable<Integer>{
private BufferedReader rdr;
private int limit;
public Summer(BufferedReader rdr, int limit) throws IOException{
this.rdr = rdr;
this.limit = limit;
}
#Override
public Integer call() throws Exception {
System.out.println(Thread.currentThread().getName() + " started " + System.currentTimeMillis());
int sum = 0;
for(int i = 0; i < limit; i++){
sum += Integer.valueOf(rdr.readLine());
// uncomment to see interleaving of threads:
//System.out.println(Thread.currentThread().getName());
}
System.out.println(Thread.currentThread().getName() + " finished " + System.currentTimeMillis());
return sum;
}
}
}

What's to stop you from simply opening the file twice, and working with it as if it were two independent files?
File inputFile = new File("src/SameFileTwice.java");
BufferedReader in1 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile)));
BufferedReader in2 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile)));
try {
String strLine;
while ((strLine = in1.readLine()) != null && (strLine = in2.readLine()) != null) {
System.out.println(strLine);
}
} finally {
in1.close();
in2.close();
}

How can I get the count of line in a file in an efficient way? [duplicate]

This question already has answers here:
Number of lines in a file in Java
(19 answers)
Closed 6 years ago.
I have a big file. It includes approximately 3.000-20.000 lines. How can I get the total count of lines in the file using Java?

BufferedReader reader = new BufferedReader(new FileReader("file.txt"));
int lines = 0;
while (reader.readLine() != null) lines++;
reader.close();
Update: To answer the performance-question raised here, I made a measurement. First thing: 20.000 lines are too few, to get the program running for a noticeable time. I created a text-file with 5 million lines. This solution (started with java without parameters like -server or -XX-options) needed around 11 seconds on my box. The same with wc -l (UNIX command-line-tool to count lines), 11 seconds. The solution reading every single character and looking for '\n' needed 104 seconds, 9-10 times as much.

Files.lines
Java 8+ has a nice and short way using NIO using Files.lines. Note that you have to close the stream using try-with-resources:
long lineCount;
try (Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8)) {
lineCount = stream.count();
}
If you don't specify the character encoding, the default one used is UTF-8. You may specify an alternate encoding to match your particular data file as shown in the example above.

use LineNumberReader
something like
public static int countLines(File aFile) throws IOException {
LineNumberReader reader = null;
try {
reader = new LineNumberReader(new FileReader(aFile));
while ((reader.readLine()) != null);
return reader.getLineNumber();
} catch (Exception ex) {
return -1;
} finally {
if(reader != null)
reader.close();
}
}

I found some solution for this, it might useful for you
Below is the code snippet for, count the no.of lines from the file.
File file = new File("/mnt/sdcard/abc.txt");
LineNumberReader lineNumberReader = new LineNumberReader(new FileReader(file));
lineNumberReader.skip(Long.MAX_VALUE);
int lines = lineNumberReader.getLineNumber();
lineNumberReader.close();

Read the file through and count the number of newline characters. An easy way to read a file in Java, one line at a time, is the java.util.Scanner class.

This is about as efficient as it can get, buffered binary read, no string conversion,
FileInputStream stream = new FileInputStream("/tmp/test.txt");
byte[] buffer = new byte[8192];
int count = 0;
int n;
while ((n = stream.read(buffer)) > 0) {
for (int i = 0; i < n; i++) {
if (buffer[i] == '\n') count++;
}
}
stream.close();
System.out.println("Number of lines: " + count);

Do You need exact number of lines or only its approximation? I happen to process large files in parallel and often I don't need to know exact count of lines - I then revert to sampling. Split the file into ten 1MB chunks and count lines in each chunk, then multiply it by 10 and You'll receive pretty good approximation of line count.

All previous answers suggest to read though the whole file and count the amount of newlines you find while doing this. You commented some as "not effective" but thats the only way you can do that. A "line" is nothing else as a simple character inside the file. And to count that character you must have a look at every single character within the file.
I'm sorry, but you have no choice. :-)

This solution is about 3.6× faster than the top rated answer when tested on a file with 13.8 million lines. It simply reads the bytes into a buffer and counts the \n characters. You could play with the buffer size, but on my machine, anything above 8KB didn't make the code faster.
private int countLines(File file) throws IOException {
int lines = 0;
FileInputStream fis = new FileInputStream(file);
byte[] buffer = new byte[BUFFER_SIZE]; // BUFFER_SIZE = 8 * 1024
int read;
while ((read = fis.read(buffer)) != -1) {
for (int i = 0; i < read; i++) {
if (buffer[i] == '\n') lines++;
}
}
fis.close();
return lines;
}

If the already posted answers aren't fast enough you'll probably have to look for a solution specific to your particular problem.
For example if these text files are logs that are only appended to and you regularly need to know the number of lines in them you could create an index. This index would contain the number of lines in the file, when the file was last modified and how large the file was then. This would allow you to recalculate the number of lines in the file by skipping over all the lines you had already seen and just reading the new lines.

Old post, but I have a solution that could be usefull for next people.
Why not just use file length to know what is the progression? Of course, lines has to be almost the same size, but it works very well for big files:
public static void main(String[] args) throws IOException {
File file = new File("yourfilehere");
double fileSize = file.length();
System.out.println("=======> File size = " + fileSize);
InputStream inputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "iso-8859-1");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
int totalRead = 0;
try {
while (bufferedReader.ready()) {
String line = bufferedReader.readLine();
// LINE PROCESSING HERE
totalRead += line.length() + 1; // we add +1 byte for the newline char.
System.out.println("Progress ===> " + ((totalRead / fileSize) * 100) + " %");
}
} finally {
bufferedReader.close();
}
}
It allows to see the progression without doing any full read on the file. I know it depends on lot of elements, but I hope it will be usefull :).
[Edition]
Here is a version with estimated time. I put some SYSO to show progress and estimation. I see that you have a good time estimation errors after you have treated enough line (I try with 10M lines, and after 1% of the treatment, the time estimation was exact at 95%).
I know, some values has to be set in variable. This code is quickly written but has be usefull for me. Hope it will be for you too :).
long startProcessLine = System.currentTimeMillis();
int totalRead = 0;
long progressTime = 0;
double percent = 0;
int i = 0;
int j = 0;
int fullEstimation = 0;
try {
while (bufferedReader.ready()) {
String line = bufferedReader.readLine();
totalRead += line.length() + 1;
progressTime = System.currentTimeMillis() - startProcessLine;
percent = (double) totalRead / fileSize * 100;
if ((percent > 1) && i % 10000 == 0) {
int estimation = (int) ((progressTime / percent) * (100 - percent));
fullEstimation += progressTime + estimation;
j++;
System.out.print("Progress ===> " + percent + " %");
System.out.print(" - current progress : " + (progressTime) + " milliseconds");
System.out.print(" - Will be finished in ===> " + estimation + " milliseconds");
System.out.println(" - estimated full time => " + (progressTime + estimation));
}
i++;
}
} finally {
bufferedReader.close();
}
System.out.println("Ended in " + (progressTime) + " seconds");
System.out.println("Estimative average ===> " + (fullEstimation / j));
System.out.println("Difference: " + ((((double) 100 / (double) progressTime)) * (progressTime - (fullEstimation / j))) + "%");
Feel free to improve this code if you think it's a good solution.

Quick and dirty, but it does the job:
import java.io.*;
public class Counter {
public final static void main(String[] args) throws IOException {
if (args.length > 0) {
File file = new File(args[0]);
System.out.println(countLines(file));
}
}
public final static int countLines(File file) throws IOException {
ProcessBuilder builder = new ProcessBuilder("wc", "-l", file.getAbsolutePath());
Process process = builder.start();
InputStream in = process.getInputStream();
LineNumberReader reader = new LineNumberReader(new InputStreamReader(in));
String line = reader.readLine();
if (line != null) {
return Integer.parseInt(line.trim().split(" ")[0]);
} else {
return -1;
}
}
}

Read the file line by line and increment a counter for each line until you have read the entire file.

Try the unix "wc" command. I don't mean use it, I mean download the source and see how they do it. It's probably in c, but you can easily port the behavior to java. The problem with making your own is to account for the ending cr/lf problem.

The buffered reader is overkill
Reader r = new FileReader("f.txt");
int count = 0;
int nextchar = 0;
while (nextchar != -1){
nextchar = r.read();
if (nextchar == Character.getNumericValue('\n') ){
count++;
}
}
My search for a simple example has createde one thats actually quite poor. calling read() repeadedly for a single character is less than optimal. see here for examples and measurements.

CharBuffer vs. char[]

Is there any reason to prefer a CharBuffer to a char[] in the following:
CharBuffer buf = CharBuffer.allocate(DEFAULT_BUFFER_SIZE);
while( in.read(buf) >= 0 ) {
out.append( buf.flip() );
buf.clear();
}
vs.
char[] buf = new char[DEFAULT_BUFFER_SIZE];
int n;
while( (n = in.read(buf)) >= 0 ) {
out.write( buf, 0, n );
}
(where in is a Reader and out in a Writer)?

No, there's really no reason to prefer a CharBuffer in this case.
In general, though, CharBuffer (and ByteBuffer) can really simplify APIs and encourage correct processing. If you were designing a public API, it's definitely worth considering a buffer-oriented API.

I wanted to mini-benchmark this comparison.
Below is the class I have written.
The thing is I can't believe that the CharBuffer performed so badly. What have I got wrong?
EDIT: Since the 11th comment below I have edited the code and the output time, better performance all round but still a significant difference in times. I also tried out2.append((CharBuffer)buff.flip()) option mentioned in the comments but it was much slower than the write option used in the code below.
Results: (time in ms)
char[] : 3411
CharBuffer: 5653
public class CharBufferScratchBox
{
public static void main(String[] args) throws Exception
{
// Some Setup Stuff
String smallString =
"1111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000";
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 1000; i++)
{
stringBuilder.append(smallString);
}
String string = stringBuilder.toString();
int DEFAULT_BUFFER_SIZE = 1000;
int ITTERATIONS = 10000;
// char[]
StringReader in1 = null;
StringWriter out1 = null;
Date start = new Date();
for (int i = 0; i < ITTERATIONS; i++)
{
in1 = new StringReader(string);
out1 = new StringWriter(string.length());
char[] buf = new char[DEFAULT_BUFFER_SIZE];
int n;
while ((n = in1.read(buf)) >= 0)
{
out1.write(
buf,
0,
n);
}
}
Date done = new Date();
System.out.println("char[] : " + (done.getTime() - start.getTime()));
// CharBuffer
StringReader in2 = null;
StringWriter out2 = null;
start = new Date();
CharBuffer buff = CharBuffer.allocate(DEFAULT_BUFFER_SIZE);
for (int i = 0; i < ITTERATIONS; i++)
{
in2 = new StringReader(string);
out2 = new StringWriter(string.length());
int n;
while ((n = in2.read(buff)) >= 0)
{
out2.write(
buff.array(),
0,
n);
buff.clear();
}
}
done = new Date();
System.out.println("CharBuffer: " + (done.getTime() - start.getTime()));
}
}

If this is the only thing you're doing with the buffer, then the array is probably the better choice in this instance.
CharBuffer has lots of extra chrome on it, but none of it is relevant in this case - and will only slow things down a fraction.
You can always refactor later if you need to make things more complicated.

The difference, in practice, is actually <10%, not 30% as others are reporting.
To read and write a 5MB file 24 times, my numbers taken using a Profiler. They were on average:
char[] = 4139 ms
CharBuffer = 4466 ms
ByteBuffer = 938 (direct) ms
Individual tests a couple times favored CharBuffer.
I also tried replacing the File-based IO with In-Memory IO and the performance was similar. If you are trying to transfer from one native stream to another, then you are better off using a "direct" ByteBuffer.
With less than 10% performance difference, in practice, I would favor the CharBuffer. It's syntax is clearer, there's less extraneous variables, and you can do more direct manipulation on it (i.e. anything that asks for a CharSequence).
Benchmark is below... it is slightly wrong as the BufferedReader is allocated inside the test-method rather than outside... however, the example below allows you to isolate the IO time and eliminate factors like a string or byte stream resizing its internal memory buffer, etc.
public static void main(String[] args) throws Exception {
File f = getBytes(5000000);
System.out.println(f.getAbsolutePath());
try {
System.gc();
List<Main> impls = new java.util.ArrayList<Main>();
impls.add(new CharArrayImpl());
//impls.add(new CharArrayNoBuffImpl());
impls.add(new CharBufferImpl());
//impls.add(new CharBufferNoBuffImpl());
impls.add(new ByteBufferDirectImpl());
//impls.add(new CharBufferDirectImpl());
for (int i = 0; i < 25; i++) {
for (Main impl : impls) {
test(f, impl);
}
System.out.println("-----");
if(i==0)
continue; //reset profiler
}
System.gc();
System.out.println("Finished");
return;
} finally {
f.delete();
}
}
static int BUFFER_SIZE = 1000;
static File getBytes(int size) throws IOException {
File f = File.createTempFile("input", ".txt");
FileWriter writer = new FileWriter(f);
Random r = new Random();
for (int i = 0; i < size; i++) {
writer.write(Integer.toString(5));
}
writer.close();
return f;
}
static void test(File f, Main impl) throws IOException {
InputStream in = new FileInputStream(f);
File fout = File.createTempFile("output", ".txt");
try {
OutputStream out = new FileOutputStream(fout, false);
try {
long start = System.currentTimeMillis();
impl.runTest(in, out);
long end = System.currentTimeMillis();
System.out.println(impl.getClass().getName() + " = " + (end - start) + "ms");
} finally {
out.close();
}
} finally {
fout.delete();
in.close();
}
}
public abstract void runTest(InputStream ins, OutputStream outs) throws IOException;
public static class CharArrayImpl extends Main {
char[] buff = new char[BUFFER_SIZE];
public void runTest(InputStream ins, OutputStream outs) throws IOException {
Reader in = new BufferedReader(new InputStreamReader(ins));
Writer out = new BufferedWriter(new OutputStreamWriter(outs));
int n;
while ((n = in.read(buff)) >= 0) {
out.write(buff, 0, n);
}
}
}
public static class CharBufferImpl extends Main {
CharBuffer buff = CharBuffer.allocate(BUFFER_SIZE);
public void runTest(InputStream ins, OutputStream outs) throws IOException {
Reader in = new BufferedReader(new InputStreamReader(ins));
Writer out = new BufferedWriter(new OutputStreamWriter(outs));
int n;
while ((n = in.read(buff)) >= 0) {
buff.flip();
out.append(buff);
buff.clear();
}
}
}
public static class ByteBufferDirectImpl extends Main {
ByteBuffer buff = ByteBuffer.allocateDirect(BUFFER_SIZE * 2);
public void runTest(InputStream ins, OutputStream outs) throws IOException {
ReadableByteChannel in = Channels.newChannel(ins);
WritableByteChannel out = Channels.newChannel(outs);
int n;
while ((n = in.read(buff)) >= 0) {
buff.flip();
out.write(buff);
buff.clear();
}
}
}

I think that CharBuffer and ByteBuffer (as well as any other xBuffer) were meant for reusability so you can buf.clear() them instead of going through reallocation every time
If you don't reuse them, you're not using their full potential and it will add extra overhead. However if you're planning on scaling this function this might be a good idea to keep them there

You should avoid CharBuffer in recent Java versions, there is a bug in #subsequence(). You cannot get a subsequence from the second half of the buffer since the implementation confuses capacity and remaining. I observed the bug in java 6-0-11 and 6-0-12.

The CharBuffer version is slightly less complicated (one less variable), encapsulates buffer size handling and makes use of a standard API. Generally I would prefer this.
However there is still one good reason to prefer the array version, in some cases at least. CharBuffer was only introduced in Java 1.4 so if you are deploying to an earlier version you can't use Charbuffer (unless you roll-your-own/use a backport).
P.S If you use a backport remember to remove it once you catch up to the version containing the "real" version of the backported code.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Split Large File Into Smaller Files Using Parallel Stream in Java - java

Related

Java FileChannel Vs BufferedReader - Spring Batch - Reader

Java: Read up to x chars from a file into array

How to access the same file in two different places in Java

How can I get the count of line in a file in an efficient way? [duplicate]

CharBuffer vs. char[]

Categories

Resources