Fast parsing of strings of numbers in java - java

I have found plenty of different suggestions on how to parse an ASCII file containing double precision numbers into an array of doubles in Java. What I currently use is roughly the following:
stream = FileInputStream(fname);
breader = BufferedReader(InputStreamReader(stream));
scanner = java.util.Scanner(breader);
array = new double[size]; // size is known upfront
idx = 0;
try {
while(idx<size){
array[idx] = scanner.nextDouble();
idx++;
}
}
catch {...}
For an example file with 1 million numbers this code takes roughly 2 seconds. Similar code written in C, using fscanf, takes 0.1 second (!) Clearly I got it all wrong. I guess calling nextDouble() so many times is the wrong way to go because of the overhead, but I cannot figure out a better way.
I am no Java expert and hence I need a little help with this: can you tell me how to improve this code?
Edit The corresponding C code follows
fd = fopen(fname, "r+");
vals = calloc(sizeof(double), size);
do{
nel = fscanf(fd, "%lf", vals+idx);
idx++;
} while(nel!=-1);

(Summarizing some of the things that I already mentioned in the comments:)
You should be careful with manual benchmarks. The answer to the question How do I write a correct micro-benchmark in Java? points out some of the basic caveats. However, this case is not so prone to the classical pitfalls. In fact, the opposite might be the case: When the benchmark solely consists of reading a file, then you are most likely not benchmarking the code, but mainly the hard disc. This involves the usual side effects of caching.
However, there obviously is an overhead beyond the pure file IO.
You should be aware that the Scanner class is very powerful and convenient. But internally, it is a beast consisting of large regular expressions and hides a tremendous complexity from the user - a complexity that is not necessary at all when your intention is to only read double values!
There are solutions with less overhead.
Unfortunately, the simplest solution is only applicable when the numbers in the input are separated by line separators. Then, reading this file into an array could be written as
double result[] =
Files.lines(Paths.get(fileName))
.mapToDouble(Double::parseDouble)
.toArray();
and this could even be rather fast. When there are multiple numbers in one line (as you mentioned in the comment), then this could be extended:
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
So regarding the general question of how to efficiently read a set of double values from a file, separated by whitespaces (but not necessarily separated by newlines), I wrote a small test.
This should not be considered as a real benchmark, and be taken with a grain of salt, but it at least tries to address some basic issues: It reads files with different sizes, multiple times, with different methods, so that for the later runs, the effects of hard disc caching should be the same for all methods:
Updated to generate sample data as described in the comment, and added the stream-based approach
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.StreamTokenizer;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Locale;
import java.util.Random;
import java.util.Scanner;
import java.util.StringTokenizer;
import java.util.stream.Stream;
public class ReadingFileWithDoubles
{
private static final int MIN_SIZE = 256000;
private static final int MAX_SIZE = 2048000;
public static void main(String[] args) throws IOException
{
generateFiles();
long before = 0;
long after = 0;
double result[] = null;
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
for (int i=0; i<10; i++)
{
before = System.nanoTime();
result = readWithScanner(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithScanner " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStreamTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStreamTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithBufferAndStringTokenizer(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithBufferAndStringTokenizer " +
(after - before) / 1e6 +
", result " + result);
before = System.nanoTime();
result = readWithStream(fileName, n);
after = System.nanoTime();
System.out.println(
"size = " + n +
", readWithStream " +
(after - before) / 1e6 +
", result " + result);
}
}
}
private static double[] readWithScanner(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
Scanner scanner = new Scanner(br))
{
// Do this to avoid surprises on systems with a different locale!
scanner.useLocale(Locale.ENGLISH);
int idx = 0;
double array[] = new double[size];
while (idx < size)
{
array[idx] = scanner.nextDouble();
idx++;
}
return array;
}
}
private static double[] readWithStreamTokenizer(
String fileName, int size) throws IOException
{
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StreamTokenizer st = new StreamTokenizer(br);
st.resetSyntax();
st.wordChars('0', '9');
st.wordChars('.', '.');
st.wordChars('-', '-');
st.wordChars('e', 'e');
st.wordChars('E', 'E');
double array[] = new double[size];
int index = 0;
boolean eof = false;
do
{
int token = st.nextToken();
switch (token)
{
case StreamTokenizer.TT_EOF:
eof = true;
break;
case StreamTokenizer.TT_WORD:
double d = Double.parseDouble(st.sval);
array[index++] = d;
break;
}
} while (!eof);
return array;
}
}
// This one is reading the whole file into memory, as a String,
// which may not be appropriate for large files
private static double[] readWithBufferAndStringTokenizer(
String fileName, int size) throws IOException
{
double array[] = new double[size];
try (
InputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr))
{
StringBuilder sb = new StringBuilder();
char buffer[] = new char[1024];
while (true)
{
int n = br.read(buffer);
if (n == -1)
{
break;
}
sb.append(buffer, 0, n);
}
int index = 0;
StringTokenizer st = new StringTokenizer(sb.toString());
while (st.hasMoreTokens())
{
array[index++] = Double.parseDouble(st.nextToken());
}
return array;
}
}
private static double[] readWithStream(
String fileName, int size) throws IOException
{
double result[] =
Files.lines(Paths.get(fileName))
.flatMap(s -> Stream.of(s.split("\\s+")))
.mapToDouble(Double::parseDouble)
.toArray();
return result;
}
private static void generateFiles() throws IOException
{
for (int n=MIN_SIZE; n<=MAX_SIZE; n*=2)
{
String fileName = "doubles"+n+".txt";
if (!new File(fileName).exists())
{
System.out.println("Creating "+fileName);
writeDoubles(new FileOutputStream(fileName), n);
}
else
{
System.out.println("File "+fileName+" already exists");
}
}
}
private static void writeDoubles(OutputStream os, int n) throws IOException
{
OutputStreamWriter writer = new OutputStreamWriter(os);
Random random = new Random(0);
int numbersPerLine = random.nextInt(4) + 1;
for (int i=0; i<n; i++)
{
writer.write(String.valueOf(random.nextDouble()));
numbersPerLine--;
if (numbersPerLine == 0)
{
writer.write("\n");
numbersPerLine = random.nextInt(4) + 1;
}
else
{
writer.write(" ");
}
}
writer.close();
}
}
It compares 4 methods:
Reading with a Scanner, as in your original code snippet
Reading with a StreamTokenizer
Reading the whole file into a String, and dissecting it with a StringTokenizer
Reading the file as a Stream of lines, which are then flat-mapped to a Stream of tokens, which are then mapped to a DoubleStream
Reading the file as one large String may not be appropriate in all cases: When the files become (much) larger, then keeping the whole file in memory as a String may not be a viable solution.
A test run (on a rather old PC, with a slow hard disc drive (no solid state)) showed roughly these results:
...
size = 1024000, readWithScanner 9932.940919, result [D#1c7353a
size = 1024000, readWithStreamTokenizer 1187.051427, result [D#1a9515
size = 1024000, readWithBufferAndStringTokenizer 1172.235019, result [D#f49f1c
size = 1024000, readWithStream 2197.785473, result [D#1469ea2 ...
Obviously, the scanner imposes a considerable overhead that may be avoided when reading more directly from the stream.
This may not be the final answer, as there may be more efficient and/or more elegant solutions (and I'm looking forward to see them!), but maybe it is helpful at least.
EDIT
A small remark: There is a certain conceptual difference between the approaches in general. Roughly speaking, the difference lies in who determines the number of elements that are read. In pseudocode, this difference is
double array[] = new double[size];
for (int i=0; i<size; i++)
{
array[i] = readDoubleFromInput();
}
versus
double array[] = new double[size];
int index = 0;
while (thereAreStillNumbersInTheInput())
{
double d = readDoubleFromInput();
array[index++] = d;
}
Your original approach with the scanner was written like the first one, while the solutions that I proposed are more similar to the second. But this should not make a large difference here, assuming that the size is indeed the real size, and potential errors (like too few or too many numbers in the input) don't appear or are handled in some other way.

Related

Take values from a text file and put them in a array

For now in my program i am using hard-coded values, but i want it so that the user can use any text file and get the same result.
import java.io.IOException;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
public class a1_12177903
{
public static void main(String [] args) throws IOException
{
if (args[0] == null)
{
System.out.println("File not found");
}
else
{
File file = new File(args[0]);
FileReader fr = new FileReader(file);
BufferedReader br = new BufferedReader(fr);
String line = "";
while (br.ready())
{
line += br.readLine();
}
String[] work = line.split(",");
double[] doubleArr = new double[work.length];
for (int i =0; i < doubleArr.length; i++)
{
doubleArr[i] = Double.parseDouble(work[i]);
}
double maxStartIndex=0;
double maxEndIndex=0;
double maxSum = 0;
double total = 0;
double maxStartIndexUntilNow = 0;
for (int currentIndex = 0; currentIndex < doubleArr.length; currentIndex++)
{
double eachArrayItem = doubleArr[currentIndex];
total += eachArrayItem;
if(total > maxSum)
{
maxSum = total;
maxStartIndex = maxStartIndexUntilNow;
maxEndIndex = currentIndex;
}
if (total < 0)
{
maxStartIndexUntilNow = currentIndex;
total = 0;
}
}
System.out.println("Max sum : "+ maxSum);
System.out.println("Max start index : "+ maxStartIndex);
System.out.println("Max end index : " +maxEndIndex);
}
}
}
I've fixed it so it takes in the name of the text file from the command line. if anyone has any ways to improve this, I'll happily accept any improvments.
You can do this with Java8 Streams, assuming each entry has it's own line
double[] doubleArr = Files.lines(pathToFile)
.mapToDouble(Double::valueOf)
.toArray();
If you were using this on production systems (rather than as an exercise) it would be worth while to create the Stream inside a Try with Resources block. This will make sure your input file is closed properly.
try(Stream<String> lines = Files.lines(path)){
doubleArr = stream.mapToDouble(Double::valueOf)
.toArray();
}
If you have a comma separated list, you will need to split them first and use a flatMap.
double[] doubleArr = Files.lines(pathToFile)
.flatMap(line->Stream.of(line.split(","))
.mapToDouble(Double::valueOf)
.toArray();
public static void main(String[] args) throws IOException {
String fileName = "";
File inputFile = new File(fileName);
BufferedReader br = new BufferedReader(new FileReader(inputFile));
// if input is in single line
StringTokenizer str = new StringTokenizer(br.readLine());
double[] intArr = new double[str.countTokens()];
for (int i = 0; i < str.countTokens(); i++) {
intArr[i] = Double.parseDouble(str.nextToken());
}
// if multiple lines in input file for a single case
String line = "";
ArrayList<Double> arryList = new ArrayList<>();
while ((line = br.readLine()) != null) {
// delimiter of your choice
for (String x : line.split(" ")) {
arryList.add(Double.parseDouble(x));
}
}
// convert arraylist to array or maybe process arrayList
}
This link may help: How to use BufferedReader. Then you will get a String containing the array.
Next you have several ways to analyze the string into an array.
Use JSONArray to parse it. For further information, search google for JSON.
Use the function split() to parse string to array. See below.
Code for way 2:
String line="10,20,50";//in fact you get this from file input.
String[] raw=line.split(",");
String[] arr=new String[raw.length];
for(int i=0;i<raw.length;++i)arr[i]=raw[i];
//now arr is what you want
Use streams if you are on JDK8. And please take care of design principles/patterns as well. It seems like a strategy/template design pattern can be applied here. I know, nobody here would ask you to focus on design guidelines.And also please take care of naming conventions. "File" as class name is not a good name.

Reading N lines from a file in java?

I am not quite sure how to explain my question but I will try my best. Say for example, I have a file containing 100 numbers, is it possible to read lines 25-50 from this 100 numbers file.
To read N amount from begining, I would do something like this;
ArrayList<Double> array = new ArrayList<Double>();
Scanner input = new Scanner(new File("numbers.txt"));
int counter = 0;
while(input.hasNextLine() && counter < 10)
{
array.add(Double.parseDouble(input.nextLine()));
counter++;
}
But I am not quite sure how I can go about start reading from a given line e.g. lines 25-50 or 25-75 or 75-100 etc.
Any help is much appreciated and please let me know if my question is not clear.
edit:
Some data in the file:
1.45347,1.1545,1.2405
1.467,1.4554,1.2233
1.4728,1.3299,1.1532
1.131,1.5139,1.0044
1.4614,1.7373,1.6235
1.654,1.5544,1.61147
byte[] inputBytes = "line 1\nline 2\nline 3\ntok 1 tok 2".getBytes();
Reader r = new InputStreamReader(new ByteArrayInputStream(inputBytes));
BufferedReader br = new BufferedReader(r);
Scanner s = new Scanner(br);
System.out.println("First line: " + br.readLine());
System.out.println("Second line: " + br.readLine());
System.out.println("Third line: " + br.readLine());
System.out.println("Remaining tokens:");
while (s.hasNext())
System.out.println(s.next());
and add a while loop like Astra suggested
Using Java 8 you have an easy solution. Note that the below code does not make any bounds checking of any kind (this is left as an exercise):
private static final Pattern COMMA = Pattern.compile(",");
public static List<Double> readNumbers(final String file,
final int startLine, final int endLine)
throws IOException
{
final long skip = (long) (startLine - 1);
final long limit = (long) (endLine - startLine);
final Path path = Paths.get(file);
try (
final Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8);
) {
return stream.skip(skip).limit(limit)
.flatMap(COMMA::splitAsStream)
.map(Double::valueOf)
.collect(Collectors.toList());
}
}
It also appears that the problem is unclear; the code above reads all doubles in a given line range. If what you want is to read all doubles from a given start "index" to a given end "index", all you have to do in the code above is change the placement of the .skip().limit() to after the .map().
Assuming you have multiple numbers (unknown number of numbers) on each line:
int start = 25;
int n finish = 50;
String delimit = ",";
List<Double> array = new ArrayList<Double>(finish - start);
Scanner input = new Scanner(new File("numbers.txt"));
int counter = 1;
while(input.hasNextLine() && counter <= finish)
{
String line = input.nextLine();
String[] splits = line.split(delimit);
for (int i=0; i<splits.length; i++){
if (counter >= start && counter <=finish){
array.add(Double.parseDouble(splits[i]));
}
counter++;
}
}

Splitting a .gz file into specified file sizes in Java

This is my first posting, so not sure how apt my description of the issue is..
Below is a program I have written to split a .gz file into files based on the size of each file, the user wants. The parent .gz file is getting split, but not into the size as specified in the code.
For example, in the main I have said I want the parent file to be split into files of size 1 MB. But on executing the code, its getting split into n number of files of different sizes. Can someone help me pin point where I am going wrong? Any help would be great as I have run out of ideas..
package com.bitsighttech.collection.packaging;
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import org.apache.log4j.Logger;
public class FileSplitter
{
private static Logger logger = Logger.getLogger(FileSplitter.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
public List<File> split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
List<File> splitFileList = new ArrayList<File>();
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
System.out.println("FileSizeString----------------------"+FileSizeString);
String unit = m.group(2);
double fileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
fileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
fileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
fileSizeInMB = Double.parseDouble(FileSizeString) * KB;
}
catch (NumberFormatException e) {
logger.error("invalid number [" + fileSizeInMB + "] for expected file size");
}
System.out.println("fileSizeInMB----------------------"+fileSizeInMB);
double fileSize = fileSizeInMB * MB;
long fileSizeInByte = (long) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
System.out.println("0000000000000000000000000"+expectedNoOfFiles);
GZIPInputStream in = new GZIPInputStream(new FileInputStream(inputFile));
DataInputStream datain = new DataInputStream(in);
BufferedReader fis = new BufferedReader(new InputStreamReader(datain));
int count= 0 ;
int splinterCount = 1;
GZIPOutputStream outputFileWriter = null;
while ((count = fis.read()) != -1)
{
System.out.println("count----------------------1 "+count);
int outputFileLength = 0;
outputFileWriter = new GZIPOutputStream(new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles + ".gz"));
while ( (count = fis.read()) != -1
&& outputFileLength < fileSizeInByte
) {
outputFileWriter.write(count);
outputFileLength ++;
count = fis.read();
}
System.out.println("count----------------------2 "+count);
//outputFileWriter.finish();
outputFileWriter.close();
splinterCount ++;
}
fis.close();
datain.close();
in.close();
outputFileWriter.close();
System.out.println("Finished");
}catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return null;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return splitFileList;
}
public static void main(String args[])
{
String filePath1 = "F:\\filename.gz";
File file = new File(filePath1);
FileSplitter fileSplitter = new FileSplitter();
String splitlen = "1 MB";
int noOfFilesSplit = 3;
fileSplitter.split(file, splitlen);
}
}
Andreas' answer covers your main question, but there are a lot of problems in that code. Most importantly, you're throwing out one byte for each 'split' (the outer while calls fis.read() and ignores the value).
Why are you wrapping your gzip input stream in a DataInputStream and a BufferedReader if you're still reading it a byte at a time?
Edit
Ah, and you're also throwing out the last byte of each split, too (except for the very last one).
Hard to tell, but it looks to me like your counting the uncompressed bytes. The compressed chunks (resulting files) will be smaller.
When you compress data with gzip the output file size depends on the complexity of data. Here you are compressing equally sized blocks, but their compressed sizes are different. No lossless compression algorithm reduces the size of input by a constant factor.
If you want splinters of equal size you should split the compressed data instead of decompressing first. But that of course means that the splinters have to be decompressed in order and you can't decompress one without reading the ones that precede it.

How to speed up external merge sort in Java

I am writing code for the external merge sort. The idea is that the input files contain too many numbers to be stored in an array so you read some of it and put it into files to be stored. Here's my code. While it runs fast, it is not fast enough. I was wondering if you can think of any improvements I can make on the code. Note that at first, I sort every 1m integers together so I skip iterations of the merging algorithm.
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.security.DigestInputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;
public class ExternalSort {
public static void sort(String f1, String f2) throws Exception {
RandomAccessFile raf1 = new RandomAccessFile(f1, "rw");
RandomAccessFile raf2 = new RandomAccessFile(f2, "rw");
int fileByteSize = (int) (raf1.length() / 4);
int size = Math.min(1000000, fileByteSize);
externalSort(f1, f2, size);
boolean writeToOriginal = true;
DataOutputStream dos;
while (size <= fileByteSize) {
if (writeToOriginal) {
raf1.seek(0);
dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf1.getFD())));
} else {
raf2.seek(0);
dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf2.getFD())));
}
for (int i = 0; i < fileByteSize; i += 2 * size) {
if (writeToOriginal) {
dos = merge(f2, dos, i, size);
} else {
dos = merge(f1, dos, i, size);
}
}
dos.flush();
writeToOriginal = !writeToOriginal;
size *= 2;
}
if (writeToOriginal)
{
raf1.seek(0);
raf2.seek(0);
dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf1.getFD())));
int i = 0;
while (i < raf2.length() / 4){
dos.writeInt(raf2.readInt());
i++;
}
dos.flush();
}
}
public static void externalSort(String f1, String f2, int size) throws Exception{
RandomAccessFile raf1 = new RandomAccessFile(f1, "rw");
RandomAccessFile raf2 = new RandomAccessFile(f2, "rw");
int fileByteSize = (int) (raf1.length() / 4);
int[] array = new int[size];
DataInputStream dis = new DataInputStream(new BufferedInputStream(
new MyFileInputStream(raf1.getFD())));
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(
new MyFileOutputStream(raf2.getFD())));
int count = 0;
while (count < fileByteSize){
for (int k = 0; k < size; ++k){
array[k] = dis.readInt();
}
count += size;
Arrays.sort(array);
for (int k = 0; k < size; ++k){
dos.writeInt(array[k]);
}
}
dos.flush();
raf1.close();
raf2.close();
dis.close();
dos.close();
}
public static DataOutputStream merge(String file,
DataOutputStream dos, int start, int size) throws IOException {
RandomAccessFile raf = new RandomAccessFile(file, "rw");
RandomAccessFile raf2 = new RandomAccessFile(file, "rw");
int fileByteSize = (int) (raf.length() / 4);
raf.seek(4 * start);
raf2.seek(4 *start);
DataInputStream dis = new DataInputStream(new BufferedInputStream(
new MyFileInputStream(raf.getFD())));
DataInputStream dis3 = new DataInputStream(new BufferedInputStream(
new MyFileInputStream(raf2.getFD())));
int i = 0;
int j = 0;
int max = size * 2;
int a = dis.readInt();
int b;
if (start + size < fileByteSize) {
dis3.skip(4 * size);
b = dis3.readInt();
} else {
b = Integer.MAX_VALUE;
j = size;
}
while (i + j < max) {
if (j == size || (a <= b && i != size)) {
dos.writeInt(a);
i++;
if (start + i == fileByteSize) {
i = size;
} else if (i != size) {
a = dis.readInt();
}
} else {
dos.writeInt(b);
j++;
if (start + size + j == fileByteSize) {
j = size;
} else if (j != size) {
b = dis3.readInt();
}
}
}
raf.close();
raf2.close();
return dos;
}
public static void main(String[] args) throws Exception {
String f1 = args[0];
String f2 = args[1];
sort(f1, f2);
}
}
You might wish to merge k>2 segments at a time. This reduces the amount of I/O from n log k / log 2 to n log n / log k.
Edit: In pseudocode, this would look something like this:
void sort(List list) {
if (list fits in memory) {
list.sort();
} else {
sublists = partition list into k about equally big sublists
for (sublist : sublists) {
sort(sublist);
}
merge(sublists);
}
}
void merge(List[] sortedsublists) {
keep a pointer in each sublist, which initially points to its first element
do {
find the pointer pointing at the smallest element
add the element it points to to the result list
advance that pointer
} until all pointers have reached the end of their sublist
return the result list
}
To efficiently find the "smallest" pointer, you might employ a PriorityQueue.
I would use memory mapped files. It can be as much as 10x faster than using this type of IO. I suspect it will be much faster in this case as well. The mapped buffers use virtual memory rather heap space to store data and can be larger than your available physical memory.
We have implemented a public domain external sort in Java:
http://code.google.com/p/externalsortinginjava/
It might be faster than yours. We use strings and not integers, but you could easily modify our code by substituting integers for strings (the code was made hackable by design). At the very least, you can compare with our design.
Looking at your code, it seems like you are reading the data in units of integers. So IO will be a bottleneck I would guess. With external memory algorithms, you want to read and write blocks of data---especially in Java.
You are sorting integers so you should check out radix sort. The core idea of radix sort is that you can sort n byte integers with n passes through the data with radix 256.
You can combine this with merge sort theory.

How can I get the count of line in a file in an efficient way? [duplicate]

This question already has answers here:
Number of lines in a file in Java
(19 answers)
Closed 6 years ago.
I have a big file. It includes approximately 3.000-20.000 lines. How can I get the total count of lines in the file using Java?
BufferedReader reader = new BufferedReader(new FileReader("file.txt"));
int lines = 0;
while (reader.readLine() != null) lines++;
reader.close();
Update: To answer the performance-question raised here, I made a measurement. First thing: 20.000 lines are too few, to get the program running for a noticeable time. I created a text-file with 5 million lines. This solution (started with java without parameters like -server or -XX-options) needed around 11 seconds on my box. The same with wc -l (UNIX command-line-tool to count lines), 11 seconds. The solution reading every single character and looking for '\n' needed 104 seconds, 9-10 times as much.
Files.lines
Java 8+ has a nice and short way using NIO using Files.lines. Note that you have to close the stream using try-with-resources:
long lineCount;
try (Stream<String> stream = Files.lines(path, StandardCharsets.UTF_8)) {
lineCount = stream.count();
}
If you don't specify the character encoding, the default one used is UTF-8. You may specify an alternate encoding to match your particular data file as shown in the example above.
use LineNumberReader
something like
public static int countLines(File aFile) throws IOException {
LineNumberReader reader = null;
try {
reader = new LineNumberReader(new FileReader(aFile));
while ((reader.readLine()) != null);
return reader.getLineNumber();
} catch (Exception ex) {
return -1;
} finally {
if(reader != null)
reader.close();
}
}
I found some solution for this, it might useful for you
Below is the code snippet for, count the no.of lines from the file.
File file = new File("/mnt/sdcard/abc.txt");
LineNumberReader lineNumberReader = new LineNumberReader(new FileReader(file));
lineNumberReader.skip(Long.MAX_VALUE);
int lines = lineNumberReader.getLineNumber();
lineNumberReader.close();
Read the file through and count the number of newline characters. An easy way to read a file in Java, one line at a time, is the java.util.Scanner class.
This is about as efficient as it can get, buffered binary read, no string conversion,
FileInputStream stream = new FileInputStream("/tmp/test.txt");
byte[] buffer = new byte[8192];
int count = 0;
int n;
while ((n = stream.read(buffer)) > 0) {
for (int i = 0; i < n; i++) {
if (buffer[i] == '\n') count++;
}
}
stream.close();
System.out.println("Number of lines: " + count);
Do You need exact number of lines or only its approximation? I happen to process large files in parallel and often I don't need to know exact count of lines - I then revert to sampling. Split the file into ten 1MB chunks and count lines in each chunk, then multiply it by 10 and You'll receive pretty good approximation of line count.
All previous answers suggest to read though the whole file and count the amount of newlines you find while doing this. You commented some as "not effective" but thats the only way you can do that. A "line" is nothing else as a simple character inside the file. And to count that character you must have a look at every single character within the file.
I'm sorry, but you have no choice. :-)
This solution is about 3.6× faster than the top rated answer when tested on a file with 13.8 million lines. It simply reads the bytes into a buffer and counts the \n characters. You could play with the buffer size, but on my machine, anything above 8KB didn't make the code faster.
private int countLines(File file) throws IOException {
int lines = 0;
FileInputStream fis = new FileInputStream(file);
byte[] buffer = new byte[BUFFER_SIZE]; // BUFFER_SIZE = 8 * 1024
int read;
while ((read = fis.read(buffer)) != -1) {
for (int i = 0; i < read; i++) {
if (buffer[i] == '\n') lines++;
}
}
fis.close();
return lines;
}
If the already posted answers aren't fast enough you'll probably have to look for a solution specific to your particular problem.
For example if these text files are logs that are only appended to and you regularly need to know the number of lines in them you could create an index. This index would contain the number of lines in the file, when the file was last modified and how large the file was then. This would allow you to recalculate the number of lines in the file by skipping over all the lines you had already seen and just reading the new lines.
Old post, but I have a solution that could be usefull for next people.
Why not just use file length to know what is the progression? Of course, lines has to be almost the same size, but it works very well for big files:
public static void main(String[] args) throws IOException {
File file = new File("yourfilehere");
double fileSize = file.length();
System.out.println("=======> File size = " + fileSize);
InputStream inputStream = new FileInputStream(file);
InputStreamReader inputStreamReader = new InputStreamReader(inputStream, "iso-8859-1");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
int totalRead = 0;
try {
while (bufferedReader.ready()) {
String line = bufferedReader.readLine();
// LINE PROCESSING HERE
totalRead += line.length() + 1; // we add +1 byte for the newline char.
System.out.println("Progress ===> " + ((totalRead / fileSize) * 100) + " %");
}
} finally {
bufferedReader.close();
}
}
It allows to see the progression without doing any full read on the file. I know it depends on lot of elements, but I hope it will be usefull :).
[Edition]
Here is a version with estimated time. I put some SYSO to show progress and estimation. I see that you have a good time estimation errors after you have treated enough line (I try with 10M lines, and after 1% of the treatment, the time estimation was exact at 95%).
I know, some values has to be set in variable. This code is quickly written but has be usefull for me. Hope it will be for you too :).
long startProcessLine = System.currentTimeMillis();
int totalRead = 0;
long progressTime = 0;
double percent = 0;
int i = 0;
int j = 0;
int fullEstimation = 0;
try {
while (bufferedReader.ready()) {
String line = bufferedReader.readLine();
totalRead += line.length() + 1;
progressTime = System.currentTimeMillis() - startProcessLine;
percent = (double) totalRead / fileSize * 100;
if ((percent > 1) && i % 10000 == 0) {
int estimation = (int) ((progressTime / percent) * (100 - percent));
fullEstimation += progressTime + estimation;
j++;
System.out.print("Progress ===> " + percent + " %");
System.out.print(" - current progress : " + (progressTime) + " milliseconds");
System.out.print(" - Will be finished in ===> " + estimation + " milliseconds");
System.out.println(" - estimated full time => " + (progressTime + estimation));
}
i++;
}
} finally {
bufferedReader.close();
}
System.out.println("Ended in " + (progressTime) + " seconds");
System.out.println("Estimative average ===> " + (fullEstimation / j));
System.out.println("Difference: " + ((((double) 100 / (double) progressTime)) * (progressTime - (fullEstimation / j))) + "%");
Feel free to improve this code if you think it's a good solution.
Quick and dirty, but it does the job:
import java.io.*;
public class Counter {
public final static void main(String[] args) throws IOException {
if (args.length > 0) {
File file = new File(args[0]);
System.out.println(countLines(file));
}
}
public final static int countLines(File file) throws IOException {
ProcessBuilder builder = new ProcessBuilder("wc", "-l", file.getAbsolutePath());
Process process = builder.start();
InputStream in = process.getInputStream();
LineNumberReader reader = new LineNumberReader(new InputStreamReader(in));
String line = reader.readLine();
if (line != null) {
return Integer.parseInt(line.trim().split(" ")[0]);
} else {
return -1;
}
}
}
Read the file line by line and increment a counter for each line until you have read the entire file.
Try the unix "wc" command. I don't mean use it, I mean download the source and see how they do it. It's probably in c, but you can easily port the behavior to java. The problem with making your own is to account for the ending cr/lf problem.
The buffered reader is overkill
Reader r = new FileReader("f.txt");
int count = 0;
int nextchar = 0;
while (nextchar != -1){
nextchar = r.read();
if (nextchar == Character.getNumericValue('\n') ){
count++;
}
}
My search for a simple example has createde one thats actually quite poor. calling read() repeadedly for a single character is less than optimal. see here for examples and measurements.

Categories