How to speed up external merge sort in Java - java

I am writing code for the external merge sort. The idea is that the input files contain too many numbers to be stored in an array so you read some of it and put it into files to be stored. Here's my code. While it runs fast, it is not fast enough. I was wondering if you can think of any improvements I can make on the code. Note that at first, I sort every 1m integers together so I skip iterations of the merging algorithm.
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.security.DigestInputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;
public class ExternalSort {
public static void sort(String f1, String f2) throws Exception {
RandomAccessFile raf1 = new RandomAccessFile(f1, "rw");
RandomAccessFile raf2 = new RandomAccessFile(f2, "rw");
int fileByteSize = (int) (raf1.length() / 4);
int size = Math.min(1000000, fileByteSize);
externalSort(f1, f2, size);
boolean writeToOriginal = true;
DataOutputStream dos;
while (size <= fileByteSize) {
if (writeToOriginal) {
raf1.seek(0);
dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf1.getFD())));
} else {
raf2.seek(0);
dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf2.getFD())));
}
for (int i = 0; i < fileByteSize; i += 2 * size) {
if (writeToOriginal) {
dos = merge(f2, dos, i, size);
} else {
dos = merge(f1, dos, i, size);
}
}
dos.flush();
writeToOriginal = !writeToOriginal;
size *= 2;
}
if (writeToOriginal)
{
raf1.seek(0);
raf2.seek(0);
dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf1.getFD())));
int i = 0;
while (i < raf2.length() / 4){
dos.writeInt(raf2.readInt());
i++;
}
dos.flush();
}
}
public static void externalSort(String f1, String f2, int size) throws Exception{
RandomAccessFile raf1 = new RandomAccessFile(f1, "rw");
RandomAccessFile raf2 = new RandomAccessFile(f2, "rw");
int fileByteSize = (int) (raf1.length() / 4);
int[] array = new int[size];
DataInputStream dis = new DataInputStream(new BufferedInputStream(
new MyFileInputStream(raf1.getFD())));
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf2.getFD())));
int count = 0;
while (count < fileByteSize){
for (int k = 0; k < size; ++k){
array[k] = dis.readInt();
}
count += size;
Arrays.sort(array);
for (int k = 0; k < size; ++k){
dos.writeInt(array[k]);
}
}
dos.flush();
raf1.close();
raf2.close();
dis.close();
dos.close();
}
public static DataOutputStream merge(String file,
DataOutputStream dos, int start, int size) throws IOException {
RandomAccessFile raf = new RandomAccessFile(file, "rw");
RandomAccessFile raf2 = new RandomAccessFile(file, "rw");
int fileByteSize = (int) (raf.length() / 4);
raf.seek(4 * start);
raf2.seek(4 *start);
DataInputStream dis = new DataInputStream(new BufferedInputStream(
new MyFileInputStream(raf.getFD())));
DataInputStream dis3 = new DataInputStream(new BufferedInputStream(
new MyFileInputStream(raf2.getFD())));
int i = 0;
int j = 0;
int max = size * 2;
int a = dis.readInt();
int b;
if (start + size < fileByteSize) {
dis3.skip(4 * size);
b = dis3.readInt();
} else {
b = Integer.MAX_VALUE;
j = size;
}
while (i + j < max) {
if (j == size || (a <= b && i != size)) {
dos.writeInt(a);
i++;
if (start + i == fileByteSize) {
i = size;
} else if (i != size) {
a = dis.readInt();
}
} else {
dos.writeInt(b);
j++;
if (start + size + j == fileByteSize) {
j = size;
} else if (j != size) {
b = dis3.readInt();
}
}
}
raf.close();
raf2.close();
return dos;
}
public static void main(String[] args) throws Exception {
String f1 = args[0];
String f2 = args[1];
sort(f1, f2);
}
}

You might wish to merge k>2 segments at a time. This reduces the amount of I/O from n log k / log 2 to n log n / log k.
Edit: In pseudocode, this would look something like this:
void sort(List list) {
if (list fits in memory) {
list.sort();
} else {
sublists = partition list into k about equally big sublists
for (sublist : sublists) {
sort(sublist);
}
merge(sublists);
}
}
void merge(List[] sortedsublists) {
keep a pointer in each sublist, which initially points to its first element
do {
find the pointer pointing at the smallest element
add the element it points to to the result list
advance that pointer
} until all pointers have reached the end of their sublist
return the result list
}
To efficiently find the "smallest" pointer, you might employ a PriorityQueue.

I would use memory mapped files. It can be as much as 10x faster than using this type of IO. I suspect it will be much faster in this case as well. The mapped buffers use virtual memory rather heap space to store data and can be larger than your available physical memory.

We have implemented a public domain external sort in Java:
http://code.google.com/p/externalsortinginjava/
It might be faster than yours. We use strings and not integers, but you could easily modify our code by substituting integers for strings (the code was made hackable by design). At the very least, you can compare with our design.
Looking at your code, it seems like you are reading the data in units of integers. So IO will be a bottleneck I would guess. With external memory algorithms, you want to read and write blocks of data---especially in Java.

You are sorting integers so you should check out radix sort. The core idea of radix sort is that you can sort n byte integers with n passes through the data with radix 256.
You can combine this with merge sort theory.

Related

How to read a File character-by-character in reverse without running out-of-memory?

The Story
I've been having a problem lately...
I have to read a file in reverse character by character without running out of memory.
I can't read it line-by-line and reverse it with StringBuilder because it's a one-line file that takes up to a gig (GB) of I/O space.
Hence it would take up too much of the JVM's (and the System's) Memory.
I've decided to just read it character by character from end-to-start (back-to-front) so that I could process as much as I can without consuming too much memory.
What I've Tried
I know how to read a file in one go:
(MappedByteBuffer+FileChannel+Charset which gave me OutOfMemoryExceptions)
and read a file character-by-character with UTF-8 character support
(FileInputStream+InputStreamReader).
The problem is that FileInputStream's #read() only calls #read0() which is a native method!
Because of that I have no idea about the underlying code...
Which is why I'm here today (or at least until this is done)!
This will do it (but as written it is not very efficient).
just skip to the last location read less one and read and print the character.
then reset the location to the mark, adjust size and continue.
File f = new File("Some File name");
int size = (int) f.length();
int bsize = 1;
byte[] buf = new byte[bsize];
try (BufferedInputStream b =
new BufferedInputStream(new FileInputStream(f))) {
while (size > 0) {
b.mark(size);
b.skip(size - bsize);
int k = b.read(buf);
System.out.print((char) buf[0]);
size -= k;
b.reset();
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
This could be improved by increasing the buffer size and making equivalent adjustments in the mark and skip arguments.
Updated Version
I wasn't fully satisfied with my answer so I made it more general. Some variables could have served double duty but using meaningful names helps clarify how they are used.
Mark must be used so reset can be used. However, it only needs to be set once and is set to position 0 outside of the main loop. I do not know if marking closer to the read point is more efficient or not.
skipCnt - initally set to fileLength it is the number of bytes to skip before reading. If the number of bytes remaining is greater than the buffer size, then the skip count will be skipCnt - bsize. Else it will be 0.
remainingBytes - a running total of how many bytes are still to be read. It is updated by subtracting the current readCnt.
readCnt - how many bytes to read. If remainingBytes is greater than bsize then set to bsize, else set to remainingBytes
The while loop continuously reads the file starting near the end and then prints the just read information in reverse order. All variables are updated and the process repeats until the remainingBytes reaches 0.
File f = new File("some file");
int bsize = 16;
int fileSize = (int)f.length();
int remainingBytes = fileSize;
int skipCnt = fileSize;
byte[] buf = new byte[bsize];
try (BufferedInputStream b =
new BufferedInputStream(new FileInputStream(f))) {
b.mark(0);
while(remainingBytes > 0) {
skipCnt = skipCnt > bsize ? skipCnt - bsize : 0;
b.skip(skipCnt);
int readCnt = remainingBytes > bsize ? bsize : remainingBytes;
b.read(buf,0,readCnt);
for (int i = readCnt-1; i >= 0; i--) {
System.out.print((char) buf[i]);
}
remainingBytes -= readCnt;
b.reset();
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
This doesn't support multi byte UTF-8 characters
Using a RandomAccessFile you can easily read a file in chunks from the end to the beginning, and reverse each of the chunks.
Here's a simple example:
import java.io.FileWriter;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.stream.IntStream;
class Test {
private static final int BUF_SIZE = 10;
private static final int FILE_LINE_COUNT = 105;
public static void main(String[] args) throws Exception {
// create a large file
try (FileWriter fw = new FileWriter("largeFile.txt")) {
IntStream.range(1, FILE_LINE_COUNT).mapToObj(Integer::toString).forEach(s -> {
try {
fw.write(s + "\n");
} catch (IOException e) {
throw new RuntimeException(e);
}
});
}
// reverse the file
try (RandomAccessFile raf = new RandomAccessFile("largeFile.txt", "r")) {
long size = raf.length();
byte[] buf = new byte[BUF_SIZE];
for (long i = size - BUF_SIZE; i > -BUF_SIZE; i -= BUF_SIZE) {
long offset = Math.max(0, i);
long readSize = Math.min(i + BUF_SIZE, BUF_SIZE);
raf.seek(offset);
raf.read(buf, 0, (int) readSize);
for (int j = (int) readSize - 1; j >= 0; j--) {
System.out.print((char) buf[j]);
}
}
}
}
}
This uses a very small file and very small chunks so that you can test it easily. Increase those constants to see it work on a larger scale.
The input file contains newlines to make it easy to read the output, but the reversal doesn't depend on the file "having lines".

Java FileChannel Vs BufferedReader - Spring Batch - Reader

We process huge files (sometimes 50 GB each file). The application reads this one file and based on the business logic, it will write multiple output files (4-6).
The records in the file are of variable length and each field in a record is a delimiter separated.
Going by the understanding that reading a file using FileChannel with a ByteBuffer was always better than using a BufferedReader.readLine and then using a split by the delimiter.
BufferSizes tried 10240(10KB) and even more
Commit interval - 5000, 10000 etc
Below is how we used file channel to read:
Read byte by byte. Check if the read byte is a new line char(10) -
which means end of line.
check for delimiter bytes. capture the bytes read in a byte array(we initialized this byte array with a maximum field size of 350 bytes) until delimiter bytes are encountered.
convert these bytes read until this time, to String using UTF-8 encoding - new String(byteArr, 0, index,"UTF-8") to be specific - index is the number of bytes read until delimiter.
Using this method of reading the file using FileChannel took 57 minutes to process the file.
We want to decrease this time and tried using BufferredReader.readLine() and then use a split by delimiter, to see how it fares.
And shockingly the same file completed processing only in 7 minutes.
What's the catch here? Why FileChannel is taking more time than a buffered reader and then using a string split.
I was always under the assumption that ReadLine and Split combination will have a big performance impact?
Can any one throw light on if I was using FileChannel in a wrong way? One
Thanks in advance. Hope I have summarized the issue properly.
The below is sample code :
while (inputByteBuffer.hasRemaining() && (b = inputByteBuffer.get()) != 0){
boolean endOfField = false;
if (b == 10){
break;
}
else{
if (b == 94){//^
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b2 = inputByteBuffer.get();
if (b2 == 124){//|
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b3 = inputByteBuffer.get();
if (b3 == 94){//^
String field = new String(fieldBytes, 0, index, encoding);
if(fieldIndex == -1){
fields = new String[sizeFromAConfiguration];
}else{
fields[fieldIndex] = field;
}
fieldBytes = new byte[maxFieldSize];
endOfField = true;
fieldIndex++;
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b3, index);
}
}
else{
endOfFile = true;
//fields.add(new String(fieldBytes, 0, index, encoding));
fields[fieldIndex] = new String(fieldBytes, 0, index, encoding);
fieldBytes = new byte[maxFieldSize];
endOfField = true;
}
}else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
}
}else{
endOfFile = true;
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (endOfField){
index = 0;
}
else{
index++;
}
}
You're causing a lot of overhead with the constant hasRemaining()/read() checks as well as the constant get() calls. It would probably be better to get() the entire buffer into an array and process that directly, only calling read() when you get to the end.
And to answer a question in comments, you should not allocate a new ByteBuffer per read. This is expensive. Keep using the same one. And NB do not use a DirectByteBuffer for this application. It is not appropriate: it's only appropriate when you want the data to stay south of the JVM/JNI boundary, e.g. when merely copying between channels.
But I think I would throw this away, or rather rewrite it, using BufferedReader.read(), rather than readLine() followed by string splits, and using much the same logic as you have here, except of course that you don't need to keep calling hasRemaining() and filling the buffer, which BufferedReader will do automatically for you.
You have to take care to store the result of read() into an int, and to check it for -1 after every read().
It isn't clear to me that you should be using a Reader at all actually, unless you know you have multibyte text. Possibly a simple BufferedInputStream would be more appropriate.
While one cannot tell with certainty how a particular code will behave I would imagine the best way is to profile it just like you did.The FileChannel while percieved to be faster is actually not helping in your case.But this may not be because of reading from the file but actual processing that you do with the content you read.
One article I would like to point out while dealing with files is
https://www.redgreencode.com/why-is-java-io-slow/
Also the corresponding Github codebase
Java IO benchmark
I would like to point out this code to use a combination of both worlds
fos = new FileOutputStream(outputFile);
outFileChannel = fos.getChannel();
bufferedWriter = new BufferedWriter(Channels.newWriter(outFileChannel, "UTF-8"));
Since it is read in your case I will consider
File inputFile = new File("C:\\input.txt");
FileInputStream fis = new FileInputStream(inputFile);
FileChannel inputChannel = fis.getChannel();
BufferedReader bufferedReader = new BufferedReader(Channels.newReader(inputChannel,"UTF-8"));
Also I will tweak the chunksize and with Spring batch it is always trial and error to find sweet spot.
On a completely unrelated note the reason for your problem of not able to use BufferedReader is because of doubling of charecters and I am assuming this happens more commonly with ebcdic charecters.I will simply run a loop like this to identfy the troublemakers and eliminate at the source.
import java.io.UnsupportedEncodingException;
public class EbcdicConvertor {
public static void main(String[] args) throws UnsupportedEncodingException {
int index = 0;
for (int i = -127; i < 128; i++) {
byte[] b = new byte[1];
b[0] = (byte) i;
String cp037 = new String(b, "CP037");
if (cp037.getBytes().length == 2) {
index++;
System.out.println(i + "::" + cp037);
}
}
System.out.println(index);
}
}
The above answer is without testing my actual hypothesis.Here is an actual program to measure time.The results speak for themselves on a 200 MB file
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
public class ReadComplexDelimitedFile {
private static long total = 0;
private static final Pattern DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");
private void readFileUsingScanner() {
String s;
try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
while (stdin.hasNextLine()) {
s = stdin.nextLine();
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingCustomBufferedReader() {
try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReader() {
try (java.io.BufferedReader stdin = new java.io.BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
}
} catch (Exception e) {
System.err.println("Error");
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderByteFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
int b;
StringBuilder sb = new StringBuilder();
while ((b = stdin.read()) != -1) {
if (b == 10) {
total = total + DELIMITER_PATTERN.split(sb, 0).length;
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingFileChannelStream() {
try (RandomAccessFile fis = new RandomAccessFile(new File(this.getClass().getResource("input.txt").getPath()), "r")) {
try (FileChannel inputChannel = fis.getChannel()) {
ByteBuffer byteBuffer = ByteBuffer.allocate(8192);
ByteBuffer recordBuffer = ByteBuffer.allocate(250);
int recordLength = 0;
while ((inputChannel.read(byteBuffer)) != -1) {
byte b;
byteBuffer.flip();
while (byteBuffer.hasRemaining() && (b = byteBuffer.get()) != -1) {
if (b == 10) {
recordBuffer.flip();
total = total + splitIntoFields(recordBuffer, recordLength);
recordBuffer.clear();
recordLength = 0;
} else {
++recordLength;
recordBuffer.put(b);
}
}
byteBuffer.clear();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
private int splitIntoFields(ByteBuffer recordBuffer, int recordLength) {
byte b;
String[] fields = new String[17];
int fieldCount = -1;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < recordLength - 1; i++) {
b = recordBuffer.get(i);
if (b == 94 && recordBuffer.get(++i) == 124 && recordBuffer.get(++i) == 94) {
fields[++fieldCount] = sb.toString();
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
fields[++fieldCount] = sb.toString();
return fields.length;
}
public static void main(String args[]) {
//JVM wamrup
for (int i = 0; i < 100000; i++) {
total += i;
}
// We know scanner is slow-Still warming up
ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
List<Long> longList = new ArrayList<>(50);
for (int i = 0; i < 50; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingScanner();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingScanner");
longList.forEach(System.out::println);
// Actual performance test starts here
longList = new ArrayList<>(10);
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingCustomBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingCustomBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderByteFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderByteFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingFileChannelStream();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingFileChannelStream");
longList.forEach(System.out::println);
}
}
BufferedReader was written very long back and hence we can rewrite some parts relevant to this example.For instance we don't care about \r and skipLF or skipCR or those kinds of stuff
We are going to read the file( no need for syncrhonized)
By extension no need for StringBuffer even otherwise StringBuilder can be used.Performance improvement immediately seen.
dangerous hack,remove synchronized and replace StringBuffer with StringBuilder don't use it without proper testing and not knowing what you are doing
public String readLine() throws IOException {
StringBuilder s = null;
int startChar;
bufferLoop:
for (; ; ) {
if (nextChar >= nChars)
fill();
if (nextChar >= nChars) { /* EOF */
if (s != null && s.length() > 0)
return s.toString();
else
return null;
}
boolean eol = false;
char c = 0;
int i;
/* Skip a leftover '\n', if necessary */
charLoop:
for (i = nextChar; i < nChars; i++) {
c = cb[i];
if (c == '\n') {
eol = true;
break charLoop;
}
}
startChar = nextChar;
nextChar = i;
if (eol) {
String str;
if (s == null) {
str = new String(cb, startChar, i - startChar);
} else {
s.append(cb, startChar, i - startChar);
str = s.toString();
}
nextChar++;
return str;
}
if (s == null)
s = new StringBuilder(defaultExpectedLineLength);
s.append(cb, startChar, i - startChar);
}
}
Java 8 Intel i5 12 GB RAM Windows 10
Result:
Time taken for readFileUsingBufferedReaderFileChannel::
2581635057 1849820885 1763992972 1770510738 1746444157 1733491399
1740530125 1723907177 1724280512 1732445638
Time taken for readFileUsingBufferedReader
1851027073 1775304769 1803507033 1789979554 1786974538 1802675458
1789672780 1798036307 1789847714 1785302003
Time taken for readFileUsingCustomBufferedReader
1745220476 1721039975 1715383650 1728548462 1724746005 1718177466
1738026017 1748077438 1724608192 1736294175
Time taken for readFileUsingBufferedReaderByteFileChannel
2872857919 2480237636 2917488143 2913491126 2880117231 2904614745
2911756298 2878777496 2892169722 2888091211
Time taken for readFileUsingFileChannelStream
3039447073 2896156498 2538389366 2906287280 2887612064 2929288046
2895626578 2955326255 2897535059 2884476915
Process finished with exit code 0
I did try NIO with all possible options(provided in this post and to the best of my knowledge and research) and found that it no where came close to BufferedReader in terms of reading a text file.
Changing BufferedReader to use StringBuilder in place of StringBuffer, I don't see any significant improvement in performance (only very few seconds for some files and some of them were better using StringBuffer itself).
Removing synchronized block also didn't give much/any improvement. And it's not worth to tweak something by which we didn't receive any benefit.
The below is the time taken(reading, processing, writing - time taken for processing and writing is not significant - not even 20% of time) for file which is around 50 GB
NIO : 71.67 (Minutes)
IO (BufferedReader) : 10.84 (Minutes)
Thank you all for your time to reading and responding to this post and providing suggestions.
The main issue here is creating a new byte[] very rapidly(fieldBytes = new byte[maxFieldSize];).
Since for every iteration a new array is being created, garbage collection is being kicked off very often which triggers "stop the world" to reclaim the memory.
And also, the object creation could be expensive.
We could rather initialize the byte array once and then track the indexes to just convert the field to string with an end index.
And anyway, BufferedReader is faster than FileChannel, atleast to read the ASCII files, and to keep the code simple, we continued using Bufferred Reader itself.
Using Bufferred reader, the development and testing effort can be reduced by not having tedious logic to find delimiters and populating the object.

Different results with different number of threads

I try to read a file in chunks and to pass each chunk to a thread that will count how many times each byte in the chunk is contained. The trouble is that when I pass the whole file to only one thread I get correct result but passing it to multiple threads the result becomes very strange. Here`s my code:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class Main{
public static void main(String[] args) throws InterruptedException, ExecutionException, IOException
{
// get number of threads to be run
Scanner in = new Scanner(System.in);
int numberOfThreads = in.nextInt();
// read file
File file = new File("testfile.txt");
long fileSize = file.length();
long chunkSize = fileSize / numberOfThreads;
FileInputStream input = new FileInputStream(file);
byte[] buffer = new byte[(int)chunkSize];
ExecutorService pool = Executors.newFixedThreadPool(numberOfThreads);
Set<Future<int[]>> set = new HashSet<Future<int[]>>();
while(input.available() > 0)
{
if(input.available() < chunkSize)
{
chunkSize = input.available();
}
input.read(buffer, 0, (int) chunkSize);
Callable<int[]> callable = new FrequenciesCounter(buffer);
Future<int[]> future = pool.submit(callable);
set.add(future);
}
// let`s assume we will use extended ASCII characters only
int alphabet = 256;
// hold how many times each character is contained in the input file
int[] frequencies = new int[alphabet];
// sum the frequencies from each thread
for(Future<int[]> future: set)
{
for(int i = 0; i < alphabet; i++)
{
frequencies[i] += future.get()[i];
}
}
input.close();
for(int i = 0; i< frequencies.length; i++)
{
if(frequencies[i] > 0) System.out.println((char)i + " " + frequencies[i]);
}
}
}
//help class for multithreaded frequencies` counting
class FrequenciesCounter implements Callable<int[]>
{
private int[] frequencies = new int[256];
private byte[] input;
public FrequenciesCounter(byte[] buffer)
{
input = buffer;
}
public int[] call()
{
for(int i = 0; i < input.length; i++)
{
frequencies[(int)input[i]]++;
}
return frequencies;
}
}
My testfile.txt is aaaaaaaaaaaaaabbbbcccccc.
With 1 thread the output is:
a 14
b 4
c 6`
With 2 threads the output is:
a 4
b 8
c 12
With 3 threads the output is:
b 6
c 18
And so other strange results that I cannot figure out. Could anybody help?
Every thread is using the same buffer, and one thread will be overwriting the buffer as another thread is trying to process it.
You need to make sure every thread has its own buffer that nobody else can modify.
Create byte[] array for every thread.
public static void main(String[] args) throws InterruptedException, ExecutionException, IOException {
// get number of threads to be run
Scanner in = new Scanner(System.in);
int numberOfThreads = in.nextInt();
// read file
File file = new File("testfile.txt");
long fileSize = file.length();
long chunkSize = fileSize / numberOfThreads;
FileInputStream input = new FileInputStream(file);
ExecutorService pool = Executors.newFixedThreadPool(numberOfThreads);
Set<Future<int[]>> set = new HashSet<Future<int[]>>();
while (input.available() > 0) {
//create buffer for every thread.
byte[] buffer = new byte[(int) chunkSize];
if (input.available() < chunkSize) {
chunkSize = input.available();
}
input.read(buffer, 0, (int) chunkSize);
Callable<int[]> callable = new FrequenciesCounter(buffer);
Future<int[]> future = pool.submit(callable);
set.add(future);
}
// let`s assume we will use extended ASCII characters only
int alphabet = 256;
// hold how many times each character is contained in the input file
int[] frequencies = new int[alphabet];
// sum the frequencies from each thread
for (Future<int[]> future : set) {
for (int i = 0; i < alphabet; i++) {
frequencies[i] += future.get()[i];
}
}
input.close();
for (int i = 0; i < frequencies.length; i++) {
if (frequencies[i] > 0)
System.out.println((char) i + " " + frequencies[i]);
}
}
}

Fast parsing of strings of numbers in java

I have found plenty of different suggestions on how to parse an ASCII file containing double precision numbers into an array of doubles in Java. What I currently use is roughly the following:
stream = FileInputStream(fname);
breader = BufferedReader(InputStreamReader(stream));
scanner = java.util.Scanner(breader);
array = new double[size]; // size is known upfront
idx = 0;
try {
while(idx<size){
array[idx] = scanner.nextDouble();
idx++;
}
}
catch {...}
For an example file with 1 million numbers this code takes roughly 2 seconds. Similar code written in C, using fscanf, takes 0.1 second (!) Clearly I got it all wrong. I guess calling nextDouble() so many times is the wrong way to go because of the overhead, but I cannot figure out a better way.
I am no Java expert and hence I need a little help with this: can you tell me how to improve this code?
Edit The corresponding C code follows
fd = fopen(fname, "r+");
vals = calloc(sizeof(double), size);
do{
nel = fscanf(fd, "%lf", vals+idx);
idx++;
} while(nel!=-1);
(Summarizing some of the things that I already mentioned in the comments:)
You should be careful with manual benchmarks. The answer to the question How do I write a correct micro-benchmark in Java? points out some of the basic caveats. However, this case is not so prone to the classical pitfalls. In fact, the opposite might be the case: When the benchmark solely consists of reading a file, then you are most likely not benchmarking the code, but mainly the hard disc. This involves the usual side effects of caching.
However, there obviously is an overhead beyond the pure file IO.
You should be aware that the Scanner class is very powerful and convenient. But internally, it is a beast consisting of large regular expressions and hides a tremendous complexity from the user - a complexity that is not necessary at all when your intention is to only read double values!
There are solutions with less overhead.
Unfortunately, the simplest solution is only applicable when the numbers in the input are separated by line separators. Then, reading this file into an array could be written as
double result[] =
Files.lines(Paths.get(fileName))
.mapToDouble(Double::parseDouble)
.toArray();
and this could even be rather fast. When there are multiple numbers in one line (as you mentioned in the comment), then this could be extended:
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
So regarding the general question of how to efficiently read a set of double values from a file, separated by whitespaces (but not necessarily separated by newlines), I wrote a small test.
This should not be considered as a real benchmark, and be taken with a grain of salt, but it at least tries to address some basic issues: It reads files with different sizes, multiple times, with different methods, so that for the later runs, the effects of hard disc caching should be the same for all methods:
Updated to generate sample data as described in the comment, and added the stream-based approach
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StreamTokenizer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Locale;
import java.util.Random;
import java.util.Scanner;
import java.util.StringTokenizer;
import java.util.stream.Stream;
public class ReadingFileWithDoubles
{
private static final int MIN_SIZE = 256000;
private static final int MAX_SIZE = 2048000;
public static void main(String[] args) throws IOException
{
generateFiles();
long before = 0;
long after = 0;
double result[] = null;
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
for (int i=0; i<10; i++)
{
before = System.nanoTime();
result = readWithScanner(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithScanner " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStreamTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStreamTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithBufferAndStringTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithBufferAndStringTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStream(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStream " +
(after - before) / 1e6 +
", result " + result);
}
}
}
private static double[] readWithScanner(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
Scanner scanner = new Scanner(br))
{
// Do this to avoid surprises on systems with a different locale!
scanner.useLocale(Locale.ENGLISH);
int idx = 0;
double array[] = new double[size];
while (idx < size)
{
array[idx] = scanner.nextDouble();
idx++;
}
return array;
}
}
private static double[] readWithStreamTokenizer(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StreamTokenizer st = new StreamTokenizer(br);
st.resetSyntax();
st.wordChars('0', '9');
st.wordChars('.', '.');
st.wordChars('-', '-');
st.wordChars('e', 'e');
st.wordChars('E', 'E');
double array[] = new double[size];
int index = 0;
boolean eof = false;
do
{
int token = st.nextToken();
switch (token)
{
case StreamTokenizer.TT_EOF:
eof = true;
break;
case StreamTokenizer.TT_WORD:
double d = Double.parseDouble(st.sval);
array[index++] = d;
break;
}
} while (!eof);
return array;
}
}
// This one is reading the whole file into memory, as a String,
// which may not be appropriate for large files
private static double[] readWithBufferAndStringTokenizer(
String fileName, int size) throws IOException
{
double array[] = new double[size];
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StringBuilder sb = new StringBuilder();
char buffer[] = new char[1024];
while (true)
{
int n = br.read(buffer);
if (n == -1)
{
break;
}
sb.append(buffer, 0, n);
}
int index = 0;
StringTokenizer st = new StringTokenizer(sb.toString());
while (st.hasMoreTokens())
{
array[index++] = Double.parseDouble(st.nextToken());
}
return array;
}
}
private static double[] readWithStream(
String fileName, int size) throws IOException
{
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
return result;
}
private static void generateFiles() throws IOException
{
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
if (!new File(fileName).exists())
{
System.out.println("Creating "+fileName);
writeDoubles(new FileOutputStream(fileName), n);
}
else
{
System.out.println("File "+fileName+" already exists");
}
}
}
private static void writeDoubles(OutputStream os, int n) throws IOException
{
OutputStreamWriter writer = new OutputStreamWriter(os);
Random random = new Random(0);
int numbersPerLine = random.nextInt(4) + 1;
for (int i=0; i<n; i++)
{
writer.write(String.valueOf(random.nextDouble()));
numbersPerLine--;
if (numbersPerLine == 0)
{
writer.write("\n");
numbersPerLine = random.nextInt(4) + 1;
}
else
{
writer.write(" ");
}
}
writer.close();
}
}
It compares 4 methods:
Reading with a Scanner, as in your original code snippet
Reading with a StreamTokenizer
Reading the whole file into a String, and dissecting it with a StringTokenizer
Reading the file as a Stream of lines, which are then flat-mapped to a Stream of tokens, which are then mapped to a DoubleStream
Reading the file as one large String may not be appropriate in all cases: When the files become (much) larger, then keeping the whole file in memory as a String may not be a viable solution.
A test run (on a rather old PC, with a slow hard disc drive (no solid state)) showed roughly these results:
...
size = 1024000, readWithScanner 9932.940919, result [D#1c7353a
size = 1024000, readWithStreamTokenizer 1187.051427, result [D#1a9515
size = 1024000, readWithBufferAndStringTokenizer 1172.235019, result [D#f49f1c
size = 1024000, readWithStream 2197.785473, result [D#1469ea2 ...
Obviously, the scanner imposes a considerable overhead that may be avoided when reading more directly from the stream.
This may not be the final answer, as there may be more efficient and/or more elegant solutions (and I'm looking forward to see them!), but maybe it is helpful at least.
EDIT
A small remark: There is a certain conceptual difference between the approaches in general. Roughly speaking, the difference lies in who determines the number of elements that are read. In pseudocode, this difference is
double array[] = new double[size];
for (int i=0; i<size; i++)
{
array[i] = readDoubleFromInput();
}
versus
double array[] = new double[size];
int index = 0;
while (thereAreStillNumbersInTheInput())
{
double d = readDoubleFromInput();
array[index++] = d;
}
Your original approach with the scanner was written like the first one, while the solutions that I proposed are more similar to the second. But this should not make a large difference here, assuming that the size is indeed the real size, and potential errors (like too few or too many numbers in the input) don't appear or are handled in some other way.

How to read the large text files efficiently in java

Here, I am reading the 18 MB file and store it in a two dimensional array. But this program takes almost 15 minutes to run. Is there anyway to optimize the running time of the program. The file contains only binary values. Thanks in advanceā€¦
public class test
{
public static void main(String[] args) throws FileNotFoundException, IOException
{
BufferedReader br;
FileReader fr=null;
int m = 2160;
int n = 4320;
int[][] lof = new int[n][m];
String filename = "D:/New Folder/ETOPOCHAR";
try {
Scanner input = new Scanner(new File("D:/New Folder/ETOPOCHAR"));
double range_km=1.0;
double alonn=-57.07; //180 to 180
double alat=38.53;
while (input.hasNextLine()) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
try
{
lof[j][i] = input.nextInt();
System.out.println("value[" + j + "][" + i + "] = "+ lof[j][i]);
}
catch (java.util.NoSuchElementException e) {
// e.printStackTrace();
}
}
} //print the input matrix
}
I have also tried with byte array but i can not save it in twoD array...
public class FileToArrayOfBytes
{
public static void main( String[] args )
{
FileInputStream fileInputStream=null;
File file = new File("name of file");
byte[] bFile = new byte[(int) file.length()];
try {
//convert file into array of bytes
fileInputStream = new FileInputStream(file);
fileInputStream.read(bFile);
fileInputStream.close();
for (int i = 0; i < bFile.length; i++) {
System.out.print((char)bFile[i]);
}
System.out.println("Done");
}catch(Exception e){
e.printStackTrace();
}
}
}
You can read the file into a byte array first, then deserialize these bytes. Start with 2048 bytes buffer (as input buffer), then experiment by increasing/decreasing its size, but the experimental buffer size values should be a power of two (512, 1024, 2048, etc).
As far as I rememenber, there are good chances that the best performance can be achived with a buffer of size 2048 bytes, but it is OS dependent and should be verified.
Code sample (here you can try different values of BUFFER_SIZE variable, in my case I've read a test file of size 7.5M in less then one second):
public static void main(String... args) throws IOException {
File f = new File(args[0]);
byte[] buffer = new byte[BUFFER_SIZE];
ByteBuffer result = ByteBuffer.allocateDirect((int) f.length());
try (FileInputStream fos = new FileInputStream(f)) {
int bytesRead;
int totalBytesRead = 0;
while ((bytesRead = fos.read(buffer, 0, BUFFER_SIZE)) != -1) {
result.put(buffer, 0, bytesRead);
totalBytesRead += bytesRead;
}
// debug info
System.out.printf("Read %d bytes\n", totalBytesRead);
// Here you can do whatever you want with the result, including creation of a 2D array...
int pos = result.position();
result.rewind();
for (int i = 0; i < pos / 4; i++) {
System.out.println(result.getInt());
}
}
}
Take your time and read docs for java.io, java.nio packages as well as Scanner class, just to improve understanding.

Categories