Is there any reason to prefer a CharBuffer to a char[] in the following:
CharBuffer buf = CharBuffer.allocate(DEFAULT_BUFFER_SIZE);
while( in.read(buf) >= 0 ) {
out.append( buf.flip() );
buf.clear();
}
vs.
char[] buf = new char[DEFAULT_BUFFER_SIZE];
int n;
while( (n = in.read(buf)) >= 0 ) {
out.write( buf, 0, n );
}
(where in is a Reader and out in a Writer)?
No, there's really no reason to prefer a CharBuffer in this case.
In general, though, CharBuffer (and ByteBuffer) can really simplify APIs and encourage correct processing. If you were designing a public API, it's definitely worth considering a buffer-oriented API.
I wanted to mini-benchmark this comparison.
Below is the class I have written.
The thing is I can't believe that the CharBuffer performed so badly. What have I got wrong?
EDIT: Since the 11th comment below I have edited the code and the output time, better performance all round but still a significant difference in times. I also tried out2.append((CharBuffer)buff.flip()) option mentioned in the comments but it was much slower than the write option used in the code below.
Results: (time in ms)
char[] : 3411
CharBuffer: 5653
public class CharBufferScratchBox
{
public static void main(String[] args) throws Exception
{
// Some Setup Stuff
String smallString =
"1111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000";
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < 1000; i++)
{
stringBuilder.append(smallString);
}
String string = stringBuilder.toString();
int DEFAULT_BUFFER_SIZE = 1000;
int ITTERATIONS = 10000;
// char[]
StringReader in1 = null;
StringWriter out1 = null;
Date start = new Date();
for (int i = 0; i < ITTERATIONS; i++)
{
in1 = new StringReader(string);
out1 = new StringWriter(string.length());
char[] buf = new char[DEFAULT_BUFFER_SIZE];
int n;
while ((n = in1.read(buf)) >= 0)
{
out1.write(
buf,
0,
n);
}
}
Date done = new Date();
System.out.println("char[] : " + (done.getTime() - start.getTime()));
// CharBuffer
StringReader in2 = null;
StringWriter out2 = null;
start = new Date();
CharBuffer buff = CharBuffer.allocate(DEFAULT_BUFFER_SIZE);
for (int i = 0; i < ITTERATIONS; i++)
{
in2 = new StringReader(string);
out2 = new StringWriter(string.length());
int n;
while ((n = in2.read(buff)) >= 0)
{
out2.write(
buff.array(),
0,
n);
buff.clear();
}
}
done = new Date();
System.out.println("CharBuffer: " + (done.getTime() - start.getTime()));
}
}
If this is the only thing you're doing with the buffer, then the array is probably the better choice in this instance.
CharBuffer has lots of extra chrome on it, but none of it is relevant in this case - and will only slow things down a fraction.
You can always refactor later if you need to make things more complicated.
The difference, in practice, is actually <10%, not 30% as others are reporting.
To read and write a 5MB file 24 times, my numbers taken using a Profiler. They were on average:
char[] = 4139 ms
CharBuffer = 4466 ms
ByteBuffer = 938 (direct) ms
Individual tests a couple times favored CharBuffer.
I also tried replacing the File-based IO with In-Memory IO and the performance was similar. If you are trying to transfer from one native stream to another, then you are better off using a "direct" ByteBuffer.
With less than 10% performance difference, in practice, I would favor the CharBuffer. It's syntax is clearer, there's less extraneous variables, and you can do more direct manipulation on it (i.e. anything that asks for a CharSequence).
Benchmark is below... it is slightly wrong as the BufferedReader is allocated inside the test-method rather than outside... however, the example below allows you to isolate the IO time and eliminate factors like a string or byte stream resizing its internal memory buffer, etc.
public static void main(String[] args) throws Exception {
File f = getBytes(5000000);
System.out.println(f.getAbsolutePath());
try {
System.gc();
List<Main> impls = new java.util.ArrayList<Main>();
impls.add(new CharArrayImpl());
//impls.add(new CharArrayNoBuffImpl());
impls.add(new CharBufferImpl());
//impls.add(new CharBufferNoBuffImpl());
impls.add(new ByteBufferDirectImpl());
//impls.add(new CharBufferDirectImpl());
for (int i = 0; i < 25; i++) {
for (Main impl : impls) {
test(f, impl);
}
System.out.println("-----");
if(i==0)
continue; //reset profiler
}
System.gc();
System.out.println("Finished");
return;
} finally {
f.delete();
}
}
static int BUFFER_SIZE = 1000;
static File getBytes(int size) throws IOException {
File f = File.createTempFile("input", ".txt");
FileWriter writer = new FileWriter(f);
Random r = new Random();
for (int i = 0; i < size; i++) {
writer.write(Integer.toString(5));
}
writer.close();
return f;
}
static void test(File f, Main impl) throws IOException {
InputStream in = new FileInputStream(f);
File fout = File.createTempFile("output", ".txt");
try {
OutputStream out = new FileOutputStream(fout, false);
try {
long start = System.currentTimeMillis();
impl.runTest(in, out);
long end = System.currentTimeMillis();
System.out.println(impl.getClass().getName() + " = " + (end - start) + "ms");
} finally {
out.close();
}
} finally {
fout.delete();
in.close();
}
}
public abstract void runTest(InputStream ins, OutputStream outs) throws IOException;
public static class CharArrayImpl extends Main {
char[] buff = new char[BUFFER_SIZE];
public void runTest(InputStream ins, OutputStream outs) throws IOException {
Reader in = new BufferedReader(new InputStreamReader(ins));
Writer out = new BufferedWriter(new OutputStreamWriter(outs));
int n;
while ((n = in.read(buff)) >= 0) {
out.write(buff, 0, n);
}
}
}
public static class CharBufferImpl extends Main {
CharBuffer buff = CharBuffer.allocate(BUFFER_SIZE);
public void runTest(InputStream ins, OutputStream outs) throws IOException {
Reader in = new BufferedReader(new InputStreamReader(ins));
Writer out = new BufferedWriter(new OutputStreamWriter(outs));
int n;
while ((n = in.read(buff)) >= 0) {
buff.flip();
out.append(buff);
buff.clear();
}
}
}
public static class ByteBufferDirectImpl extends Main {
ByteBuffer buff = ByteBuffer.allocateDirect(BUFFER_SIZE * 2);
public void runTest(InputStream ins, OutputStream outs) throws IOException {
ReadableByteChannel in = Channels.newChannel(ins);
WritableByteChannel out = Channels.newChannel(outs);
int n;
while ((n = in.read(buff)) >= 0) {
buff.flip();
out.write(buff);
buff.clear();
}
}
}
I think that CharBuffer and ByteBuffer (as well as any other xBuffer) were meant for reusability so you can buf.clear() them instead of going through reallocation every time
If you don't reuse them, you're not using their full potential and it will add extra overhead. However if you're planning on scaling this function this might be a good idea to keep them there
You should avoid CharBuffer in recent Java versions, there is a bug in #subsequence(). You cannot get a subsequence from the second half of the buffer since the implementation confuses capacity and remaining. I observed the bug in java 6-0-11 and 6-0-12.
The CharBuffer version is slightly less complicated (one less variable), encapsulates buffer size handling and makes use of a standard API. Generally I would prefer this.
However there is still one good reason to prefer the array version, in some cases at least. CharBuffer was only introduced in Java 1.4 so if you are deploying to an earlier version you can't use Charbuffer (unless you roll-your-own/use a backport).
P.S If you use a backport remember to remove it once you catch up to the version containing the "real" version of the backported code.
Related
I need to convert a Reader object into InputStream. My solution right now is below. But my concern is since this will handle big chunks of data, it will increase the memory usage drastically.
private static InputStream getInputStream(final Reader reader) {
char[] buffer = new char[10240];
StringBuilder builder = new StringBuilder();
int charCount;
try {
while ((charCount = reader.read(buffer, 0, buffer.length)) != -1) {
builder.append(buffer, 0, charCount);
}
reader.close();
} catch (final IOException e) {
e.printStackTrace();
}
return new ByteArrayInputStream(builder.toString().getBytes(StandardCharsets.UTF_8));
}
Since I use StringBuilder this will keep the full content of the reader object in memory. I want to avoid this. Is there a way I can pipe Reader object? Any help regarding this highly appreciated.
Using the Apache Commons IO library, you can do this conversion in one line:
//import org.apache.commons.io.input.ReaderInputStream;
InputStream inputStream = new ReaderInputStream(reader, StandardCharsets.UTF_8);
You can read the documentaton for this Class at https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/input/ReaderInputStream.html
It might be worth trying this to see if it solves the memory issue too.
First: a rare requirement, often it is the other way around, or there is a FileChannel, so one can use a ByteBuffer.
A PipedInputStream would be possible, starting a PipedOutputStream in a second thread. However that is unneeded.
A Reader gives chars. Unicode code points are derived from either one or two chars (the latter a surrogate pair).
/**
* Reader for an InputSteam of UTF-8 text bytes.
*/
public class ReaderInputStream extends InputStream {
private final Reader reader;
private boolean eof;
private int byteCount;
private byte[] bytes = new byte[6];
public ReaderInputStream(Reader reader) {
this.reader = reader;
}
#Override
public int read() throws IOException {
if (byteCount > 0) {
int c = bytes[0];
--byteCount;
for (int i = 0; i < byteCount; ++i) {
bytes[i] = bytes[i + 1];
}
return c;
}
if (eof) {
return -1;
}
int c = reader.read();
if (c == -1) {
eof = true;
return -1;
}
char ch = (char) c;
String s;
if (Character.isHighSurrogate(ch)) {
c = reader.read();
if (c == -1) {
// Error, low surrogate expected.
eof = true;
//return -1;
throw new IOException("Expected a low surrogate char i.o. EOF");
}
char ch2 = (char) c;
if (!Character.isLowSurrogate(ch2)) {
throw new IOException("Expected a low surrogate char");
}
s = new String(new char [] {ch, ch2});
} else {
s = Character.toString(ch);
}
byte[] bs = s.getBytes(StandardCharsets.UTF_8);
byteCount = bs.length;
System.arraycopy(bs, 0, bytes, 0, byteCount);
return read();
}
}
Path source = Paths.get("...");
Path target = Paths.get("...");
try (Reader reader = Files.newBufferedReader(source, StandardCharsets.UTF_8);
InputStream in = new ReaderInputStream(reader)) {
Files.copy(in, target);
}
We process huge files (sometimes 50 GB each file). The application reads this one file and based on the business logic, it will write multiple output files (4-6).
The records in the file are of variable length and each field in a record is a delimiter separated.
Going by the understanding that reading a file using FileChannel with a ByteBuffer was always better than using a BufferedReader.readLine and then using a split by the delimiter.
BufferSizes tried 10240(10KB) and even more
Commit interval - 5000, 10000 etc
Below is how we used file channel to read:
Read byte by byte. Check if the read byte is a new line char(10) -
which means end of line.
check for delimiter bytes. capture the bytes read in a byte array(we initialized this byte array with a maximum field size of 350 bytes) until delimiter bytes are encountered.
convert these bytes read until this time, to String using UTF-8 encoding - new String(byteArr, 0, index,"UTF-8") to be specific - index is the number of bytes read until delimiter.
Using this method of reading the file using FileChannel took 57 minutes to process the file.
We want to decrease this time and tried using BufferredReader.readLine() and then use a split by delimiter, to see how it fares.
And shockingly the same file completed processing only in 7 minutes.
What's the catch here? Why FileChannel is taking more time than a buffered reader and then using a string split.
I was always under the assumption that ReadLine and Split combination will have a big performance impact?
Can any one throw light on if I was using FileChannel in a wrong way? One
Thanks in advance. Hope I have summarized the issue properly.
The below is sample code :
while (inputByteBuffer.hasRemaining() && (b = inputByteBuffer.get()) != 0){
boolean endOfField = false;
if (b == 10){
break;
}
else{
if (b == 94){//^
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b2 = inputByteBuffer.get();
if (b2 == 124){//|
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (inputByteBuffer.hasRemaining()){
byte b3 = inputByteBuffer.get();
if (b3 == 94){//^
String field = new String(fieldBytes, 0, index, encoding);
if(fieldIndex == -1){
fields = new String[sizeFromAConfiguration];
}else{
fields[fieldIndex] = field;
}
fieldBytes = new byte[maxFieldSize];
endOfField = true;
fieldIndex++;
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b3, index);
}
}
else{
endOfFile = true;
//fields.add(new String(fieldBytes, 0, index, encoding));
fields[fieldIndex] = new String(fieldBytes, 0, index, encoding);
fieldBytes = new byte[maxFieldSize];
endOfField = true;
}
}else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
index++;
fieldBytes = addFieldBytes(fieldBytes, b2, index);
}
}else{
endOfFile = true;
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
else{
fieldBytes = addFieldBytes(fieldBytes, b, index);
}
}
if (!inputByteBuffer.hasRemaining()){
inputByteBuffer.clear();
noOfBytes = inputFileChannel.read(inputByteBuffer);
inputByteBuffer.flip();
}
if (endOfField){
index = 0;
}
else{
index++;
}
}
You're causing a lot of overhead with the constant hasRemaining()/read() checks as well as the constant get() calls. It would probably be better to get() the entire buffer into an array and process that directly, only calling read() when you get to the end.
And to answer a question in comments, you should not allocate a new ByteBuffer per read. This is expensive. Keep using the same one. And NB do not use a DirectByteBuffer for this application. It is not appropriate: it's only appropriate when you want the data to stay south of the JVM/JNI boundary, e.g. when merely copying between channels.
But I think I would throw this away, or rather rewrite it, using BufferedReader.read(), rather than readLine() followed by string splits, and using much the same logic as you have here, except of course that you don't need to keep calling hasRemaining() and filling the buffer, which BufferedReader will do automatically for you.
You have to take care to store the result of read() into an int, and to check it for -1 after every read().
It isn't clear to me that you should be using a Reader at all actually, unless you know you have multibyte text. Possibly a simple BufferedInputStream would be more appropriate.
While one cannot tell with certainty how a particular code will behave I would imagine the best way is to profile it just like you did.The FileChannel while percieved to be faster is actually not helping in your case.But this may not be because of reading from the file but actual processing that you do with the content you read.
One article I would like to point out while dealing with files is
https://www.redgreencode.com/why-is-java-io-slow/
Also the corresponding Github codebase
Java IO benchmark
I would like to point out this code to use a combination of both worlds
fos = new FileOutputStream(outputFile);
outFileChannel = fos.getChannel();
bufferedWriter = new BufferedWriter(Channels.newWriter(outFileChannel, "UTF-8"));
Since it is read in your case I will consider
File inputFile = new File("C:\\input.txt");
FileInputStream fis = new FileInputStream(inputFile);
FileChannel inputChannel = fis.getChannel();
BufferedReader bufferedReader = new BufferedReader(Channels.newReader(inputChannel,"UTF-8"));
Also I will tweak the chunksize and with Spring batch it is always trial and error to find sweet spot.
On a completely unrelated note the reason for your problem of not able to use BufferedReader is because of doubling of charecters and I am assuming this happens more commonly with ebcdic charecters.I will simply run a loop like this to identfy the troublemakers and eliminate at the source.
import java.io.UnsupportedEncodingException;
public class EbcdicConvertor {
public static void main(String[] args) throws UnsupportedEncodingException {
int index = 0;
for (int i = -127; i < 128; i++) {
byte[] b = new byte[1];
b[0] = (byte) i;
String cp037 = new String(b, "CP037");
if (cp037.getBytes().length == 2) {
index++;
System.out.println(i + "::" + cp037);
}
}
System.out.println(index);
}
}
The above answer is without testing my actual hypothesis.Here is an actual program to measure time.The results speak for themselves on a 200 MB file
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.RandomAccessFile;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;
public class ReadComplexDelimitedFile {
private static long total = 0;
private static final Pattern DELIMITER_PATTERN = Pattern.compile("\\^\\|\\^");
private void readFileUsingScanner() {
String s;
try (Scanner stdin = new Scanner(new File(this.getClass().getResource("input.txt").getPath()))) {
while (stdin.hasNextLine()) {
s = stdin.nextLine();
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingCustomBufferedReader() {
try (BufferedReader stdin = new BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReader() {
try (java.io.BufferedReader stdin = new java.io.BufferedReader(new FileReader(new File(this.getClass().getResource("input.txt").getPath())))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total += fields.length;
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
String s;
while ((s = stdin.readLine()) != null) {
String[] fields = DELIMITER_PATTERN.split(s, 0);
total = total + fields.length;
}
}
} catch (Exception e) {
System.err.println("Error");
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingBufferedReaderByteFileChannel() {
try (FileInputStream fis = new FileInputStream(this.getClass().getResource("input.txt").getPath())) {
try (FileChannel inputChannel = fis.getChannel()) {
try (BufferedReader stdin = new BufferedReader(Channels.newReader(inputChannel, "UTF-8"))) {
int b;
StringBuilder sb = new StringBuilder();
while ((b = stdin.read()) != -1) {
if (b == 10) {
total = total + DELIMITER_PATTERN.split(sb, 0).length;
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
System.err.println("Error");
}
}
private void readFileUsingFileChannelStream() {
try (RandomAccessFile fis = new RandomAccessFile(new File(this.getClass().getResource("input.txt").getPath()), "r")) {
try (FileChannel inputChannel = fis.getChannel()) {
ByteBuffer byteBuffer = ByteBuffer.allocate(8192);
ByteBuffer recordBuffer = ByteBuffer.allocate(250);
int recordLength = 0;
while ((inputChannel.read(byteBuffer)) != -1) {
byte b;
byteBuffer.flip();
while (byteBuffer.hasRemaining() && (b = byteBuffer.get()) != -1) {
if (b == 10) {
recordBuffer.flip();
total = total + splitIntoFields(recordBuffer, recordLength);
recordBuffer.clear();
recordLength = 0;
} else {
++recordLength;
recordBuffer.put(b);
}
}
byteBuffer.clear();
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
private int splitIntoFields(ByteBuffer recordBuffer, int recordLength) {
byte b;
String[] fields = new String[17];
int fieldCount = -1;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < recordLength - 1; i++) {
b = recordBuffer.get(i);
if (b == 94 && recordBuffer.get(++i) == 124 && recordBuffer.get(++i) == 94) {
fields[++fieldCount] = sb.toString();
sb = new StringBuilder();
} else {
sb.append((char) b);
}
}
fields[++fieldCount] = sb.toString();
return fields.length;
}
public static void main(String args[]) {
//JVM wamrup
for (int i = 0; i < 100000; i++) {
total += i;
}
// We know scanner is slow-Still warming up
ReadComplexDelimitedFile readComplexDelimitedFile = new ReadComplexDelimitedFile();
List<Long> longList = new ArrayList<>(50);
for (int i = 0; i < 50; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingScanner();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingScanner");
longList.forEach(System.out::println);
// Actual performance test starts here
longList = new ArrayList<>(10);
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingCustomBufferedReader();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingCustomBufferedReader");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingBufferedReaderByteFileChannel();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingBufferedReaderByteFileChannel");
longList.forEach(System.out::println);
longList.clear();
for (int i = 0; i < 10; i++) {
total = 0;
long startTime = System.nanoTime();
readComplexDelimitedFile.readFileUsingFileChannelStream();
long stopTime = System.nanoTime();
long timeDifference = stopTime - startTime;
longList.add(timeDifference);
}
System.out.println("Time taken for readFileUsingFileChannelStream");
longList.forEach(System.out::println);
}
}
BufferedReader was written very long back and hence we can rewrite some parts relevant to this example.For instance we don't care about \r and skipLF or skipCR or those kinds of stuff
We are going to read the file( no need for syncrhonized)
By extension no need for StringBuffer even otherwise StringBuilder can be used.Performance improvement immediately seen.
dangerous hack,remove synchronized and replace StringBuffer with StringBuilder don't use it without proper testing and not knowing what you are doing
public String readLine() throws IOException {
StringBuilder s = null;
int startChar;
bufferLoop:
for (; ; ) {
if (nextChar >= nChars)
fill();
if (nextChar >= nChars) { /* EOF */
if (s != null && s.length() > 0)
return s.toString();
else
return null;
}
boolean eol = false;
char c = 0;
int i;
/* Skip a leftover '\n', if necessary */
charLoop:
for (i = nextChar; i < nChars; i++) {
c = cb[i];
if (c == '\n') {
eol = true;
break charLoop;
}
}
startChar = nextChar;
nextChar = i;
if (eol) {
String str;
if (s == null) {
str = new String(cb, startChar, i - startChar);
} else {
s.append(cb, startChar, i - startChar);
str = s.toString();
}
nextChar++;
return str;
}
if (s == null)
s = new StringBuilder(defaultExpectedLineLength);
s.append(cb, startChar, i - startChar);
}
}
Java 8 Intel i5 12 GB RAM Windows 10
Result:
Time taken for readFileUsingBufferedReaderFileChannel::
2581635057 1849820885 1763992972 1770510738 1746444157 1733491399
1740530125 1723907177 1724280512 1732445638
Time taken for readFileUsingBufferedReader
1851027073 1775304769 1803507033 1789979554 1786974538 1802675458
1789672780 1798036307 1789847714 1785302003
Time taken for readFileUsingCustomBufferedReader
1745220476 1721039975 1715383650 1728548462 1724746005 1718177466
1738026017 1748077438 1724608192 1736294175
Time taken for readFileUsingBufferedReaderByteFileChannel
2872857919 2480237636 2917488143 2913491126 2880117231 2904614745
2911756298 2878777496 2892169722 2888091211
Time taken for readFileUsingFileChannelStream
3039447073 2896156498 2538389366 2906287280 2887612064 2929288046
2895626578 2955326255 2897535059 2884476915
Process finished with exit code 0
I did try NIO with all possible options(provided in this post and to the best of my knowledge and research) and found that it no where came close to BufferedReader in terms of reading a text file.
Changing BufferedReader to use StringBuilder in place of StringBuffer, I don't see any significant improvement in performance (only very few seconds for some files and some of them were better using StringBuffer itself).
Removing synchronized block also didn't give much/any improvement. And it's not worth to tweak something by which we didn't receive any benefit.
The below is the time taken(reading, processing, writing - time taken for processing and writing is not significant - not even 20% of time) for file which is around 50 GB
NIO : 71.67 (Minutes)
IO (BufferedReader) : 10.84 (Minutes)
Thank you all for your time to reading and responding to this post and providing suggestions.
The main issue here is creating a new byte[] very rapidly(fieldBytes = new byte[maxFieldSize];).
Since for every iteration a new array is being created, garbage collection is being kicked off very often which triggers "stop the world" to reclaim the memory.
And also, the object creation could be expensive.
We could rather initialize the byte array once and then track the indexes to just convert the field to string with an end index.
And anyway, BufferedReader is faster than FileChannel, atleast to read the ASCII files, and to keep the code simple, we continued using Bufferred Reader itself.
Using Bufferred reader, the development and testing effort can be reduced by not having tedious logic to find delimiters and populating the object.
What is the easiest way to check (in a unit test) whether binary files A and B are equal?
Are third-party libraries fair game? Guava has Files.equal(File, File). There's no real reason to bother with hashing if you don't have to; it can only be less efficient.
There's always just reading byte by byte from each file and comparing them as you go. Md5 and Sha1 etc still have to read all the bytes so computing the hash is extra work that you don't have to do.
if (file1.length() != file2.length()) {
return false;
}
try( InputStream in1 = new BufferedInputStream(new FileInputStream(file1));
InputStream in2 = new BufferedInputStream(new FileInputStream(file2));
) {
int value1, value2;
do {
//since we're buffered, read() isn't expensive
value1 = in1.read();
value2 = in2.read();
if(value1 != value2) {
return false;
}
} while(value1 >= 0);
// since we already checked that the file sizes are equal
// if we're here we reached the end of both files without a mismatch
return true;
}
With assertBinaryEquals.
public static void assertBinaryEquals(java.io.File expected,
java.io.File actual)
http://junit-addons.sourceforge.net/junitx/framework/FileAssert.html
Read the files in (small) blocks and compare them:
static boolean binaryDiff(File a, File b) throws IOException {
if(a.length() != b.length()){
return false;
}
final int BLOCK_SIZE = 128;
InputStream aStream = new FileInputStream(a);
InputStream bStream = new FileInputStream(b);
byte[] aBuffer = new byte[BLOCK_SIZE];
byte[] bBuffer = new byte[BLOCK_SIZE];
do {
int aByteCount = aStream.read(aBuffer, 0, BLOCK_SIZE);
bStream.read(bBuffer, 0, BLOCK_SIZE);
if (!Arrays.equals(aBuffer, bBuffer)) {
return false;
}
}
while(aByteCount < 0);
return true;
}
If you want to avoid dependencies you can do it using quite nicely with Files.readAllBytes and Assert.assertArrayEquals
Assert.assertArrayEquals("Binary files differ",
Files.readAllBytes(Paths.get(expectedBinaryFile)),
Files.readAllBytes(Paths.get(actualBinaryFile)));
Note: This will read the whole file so it might not be efficient with large files.
Since Java 12 you could also use the Files.mismatch method JavaDoc. It will return -1L if the files are the same.
I had to do the same in a unit test too, so I used SHA1 hashes to do that, to spare the the calculation of the hashes I check if the files sizes are equal first. Here was my attempt:
public class SHA1Compare {
private static final int CHUNK_SIZE = 4096;
public void assertEqualsSHA1(String expectedPath, String actualPath) throws IOException, NoSuchAlgorithmException {
File expectedFile = new File(expectedPath);
File actualFile = new File(actualPath);
Assert.assertEquals(expectedFile.length(), actualFile.length());
try (FileInputStream fisExpected = new FileInputStream(actualFile);
FileInputStream fisActual = new FileInputStream(expectedFile)) {
Assert.assertEquals(makeMessageDigest(fisExpected),
makeMessageDigest(fisActual));
}
}
public String makeMessageDigest(InputStream is) throws NoSuchAlgorithmException, IOException {
byte[] data = new byte[CHUNK_SIZE];
MessageDigest md = MessageDigest.getInstance("SHA1");
int bytesRead = 0;
while(-1 != (bytesRead = is.read(data, 0, CHUNK_SIZE))) {
md.update(data, 0, bytesRead);
}
return toHexString(md.digest());
}
private String toHexString(byte[] digest) {
StringBuilder sha1HexString = new StringBuilder();
for(int i = 0; i < digest.length; i++) {
sha1HexString.append(String.format("%1$02x", Byte.valueOf(digest[i])));
}
return sha1HexString.toString();
}
}
I have the below class.
class MyObject implements Serializable {
private String key;
private String val;
private int num;
MyObject(String a, String b, int c) {
this.key = a;
this.val = b;
this.num = c;
}
}
I need to create a list of Objects, the following method is called repeatedly (say 10K times or more)
public void addToIndex(String a, String b, int c) {
MyObject ob = new MyObject(a,b,c);
list.add(ob); // List<MyObject>
}
I used a profiler to see the memory footprint, and it increases so much due to creation of object everytime. Is there a better way of doing this? I am writing the list then to disk.
EDIT: This is how I write once the list is fully populated. Is there a way to append once the memory goes beyond a value (size of list).
ObjectOutputStream oos = new ObjectOutputStream(
new DeflaterOutputStream(new FileOutputStream(
list)));
oos.writeObject(list);
oos.close();
I used a profiler to see the memory footprint, and it increases so much due to creation of object everytime. Is there a better way of doing this?
Java Serialization doesn't use that much memory in your situation. What it does so is create a lot of garbage, far more than you might imagine. It also has a very verbose output which can be improved using compression as you do.
A simple way to improve this situation is to use Externalizable instead of Serializable. This can reduce the garbage produced dramatically and make it more compact. It can also be much faster with lower over head.
BTW You can get even better performance if you use custom serialization for the list itself.
public class Main {
public static void main(String[] args) throws IOException, ClassNotFoundException {
List<MyObject> list = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
list.add(new MyObject("key-" + i, "value-" + i, i));
}
for (int i = 0; i < 10; i++) {
timeJavaSerialization(list);
timeCustomSerialization(list);
timeCustomSerialization2(list);
}
}
private static void timeJavaSerialization(List<MyObject> list) throws IOException, ClassNotFoundException {
File file = File.createTempFile("java-serialization", "dz");
long start = System.nanoTime();
ObjectOutputStream oos = new ObjectOutputStream(
new DeflaterOutputStream(new FileOutputStream(file)));
oos.writeObject(list);
oos.close();
ObjectInputStream ois = new ObjectInputStream(
new InflaterInputStream(new FileInputStream(file)));
Object o = ois.readObject();
ois.close();
long time = System.nanoTime() - start;
long size = file.length();
System.out.printf("Java serialization uses %,d bytes and too %.3f seconds.%n",
size, time / 1e9);
}
private static void timeCustomSerialization(List<MyObject> list) throws IOException {
File file = File.createTempFile("custom-serialization", "dz");
long start = System.nanoTime();
MyObject.writeList(file, list);
Object o = MyObject.readList(file);
long time = System.nanoTime() - start;
long size = file.length();
System.out.printf("Faster Custom serialization uses %,d bytes and too %.3f seconds.%n",
size, time / 1e9);
}
private static void timeCustomSerialization2(List<MyObject> list) throws IOException {
File file = File.createTempFile("custom2-serialization", "dz");
long start = System.nanoTime();
{
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(
new DeflaterOutputStream(new FileOutputStream(file))));
dos.writeInt(list.size());
for (MyObject mo : list) {
dos.writeUTF(mo.key);
}
for (MyObject mo : list) {
dos.writeUTF(mo.val);
}
for (MyObject mo : list) {
dos.writeInt(mo.num);
}
dos.close();
}
{
DataInputStream dis = new DataInputStream(new BufferedInputStream(
new InflaterInputStream(new FileInputStream(file))));
int len = dis.readInt();
String[] keys = new String[len];
String[] vals = new String[len];
List<MyObject> list2 = new ArrayList<>(len);
for (int i = 0; i < len; i++) {
keys[i] = dis.readUTF();
}
for (int i = 0; i < len; i++) {
vals[i] = dis.readUTF();
}
for (int i = 0; i < len; i++) {
list2.add(new MyObject(keys[i], vals[i], dis.readInt()));
}
dis.close();
}
long time = System.nanoTime() - start;
long size = file.length();
System.out.printf("Compact Custom serialization uses %,d bytes and too %.3f seconds.%n",
size, time / 1e9);
}
static class MyObject implements Serializable {
private String key;
private String val;
private int num;
MyObject(String a, String b, int c) {
this.key = a;
this.val = b;
this.num = c;
}
MyObject(DataInput in) throws IOException {
key = in.readUTF();
val = in.readUTF();
num = in.readInt();
}
public void writeTo(DataOutput out) throws IOException {
out.writeUTF(key);
out.writeUTF(val);
out.writeInt(num);
}
public static void writeList(File file, List<MyObject> list) throws IOException {
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(
new DeflaterOutputStream(new FileOutputStream(file))));
dos.writeInt(list.size());
for (MyObject mo : list) {
mo.writeTo(dos);
}
dos.close();
}
public static List<MyObject> readList(File file) throws IOException {
DataInputStream dis = new DataInputStream(new BufferedInputStream(
new InflaterInputStream(new FileInputStream(file))));
int len = dis.readInt();
List<MyObject> list = new ArrayList<>(len);
for (int i = 0; i < len; i++) {
list.add(new MyObject(dis));
}
dis.close();
return list;
}
}
}
prints finally
Java serialization uses 61,168 bytes and too 0.061 seconds.
Faster Custom serialization uses 62,519 bytes and too 0.024 seconds.
Compact Custom serialization uses 68,225 bytes and too 0.020 seconds.
As you can see my attempts to make the file more compact instead made it faster, which is a good example of why you should test performance improvements.
Consider using fast-serialization. It is source-level compatible to JDK-serialization, and creates less bloat.
Additionally it beats most of handcrafted "Externalizable" serialization, as its not only the JDK-serialization implementation itself, but also inefficient In/Output stream implementations of stock JDK which hurt performance.
http://code.google.com/p/fast-serialization/
I want to read from file from two different places concurrently. I also want to use buffered i/o stream for efficiency. I tried to work out sth on my own given java API, but it's not working. Anybody will help? I need it for external merge-sort. Thanks for help!
You need to create a RandomAccessFile, which is basically Java's equivalent of C's memory mapped file.
I found an example of this:
try {
File file = new File("filename");
// Create a read-only memory-mapped file
FileChannel roChannel = new RandomAccessFile(file, "r").getChannel();
ByteBuffer roBuf = roChannel.map(FileChannel.MapMode.READ_ONLY, 0, (int)roChannel.size());
// Create a read-write memory-mapped file
FileChannel rwChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE, 0, (int)rwChannel.size());
// Create a private (copy-on-write) memory-mapped file.
// Any write to this channel results in a private copy of the data.
FileChannel pvChannel = new RandomAccessFile(file, "rw").getChannel();
ByteBuffer pvBuf = roChannel.map(FileChannel.MapMode.READ_WRITE, 0, (int)rwChannel.size());
} catch (IOException e) {
}
Edit, you stated you can't use a RandomAccessFile, which is the only way to skip up and down through the file. If you're stuck without it, then you must read the file sequentially, but that doesn't mean that you can't open multiple pointers to the same file for reading.
I put together the following test/sample and it shows clearly that you can open the file "twice" with different read pointers and sequentially sum two halves of the file. Again, if you need random access, you must use a RandomAccessFile, and that's what I'd suggest, but here you go:
public class FileTest {
public static void main(String[] args) throws IOException, InterruptedException, ExecutionException{
File temp = File.createTempFile("asfd", "");
BufferedWriter wrt = new BufferedWriter(new FileWriter(temp));
int testLength = 10000;
int numWidth = String.valueOf(testLength).length();
int targetSum = 0;
for(int i = 0; i < testLength; i++){
// each line guaranteed to have a good number of characters for our test
wrt.write(String.format("%0"+ numWidth +"d\n", i));
targetSum += i;
}
wrt.close();
BufferedReader rdr1 = new BufferedReader(new FileReader(temp));
BufferedReader rdr2 = new BufferedReader(new FileReader(temp));
rdr2.skip((numWidth+1)*testLength / 2); // skip first half of the lines
Summer sum1 = new Summer(rdr1, testLength / 2);
Summer sum2 = new Summer(rdr2, testLength / 2);
ExecutorService executor = Executors.newFixedThreadPool(2);
Future<Integer> halfSum1 = executor.submit(sum1);
Future<Integer> halfSum2 = executor.submit(sum2);
System.out.println("Total sum = " + (halfSum1.get() + halfSum2.get()) + " reference " + targetSum);
rdr1.close();
rdr2.close();
temp.delete();
}
private static class Summer implements Callable<Integer>{
private BufferedReader rdr;
private int limit;
public Summer(BufferedReader rdr, int limit) throws IOException{
this.rdr = rdr;
this.limit = limit;
}
#Override
public Integer call() throws Exception {
System.out.println(Thread.currentThread().getName() + " started " + System.currentTimeMillis());
int sum = 0;
for(int i = 0; i < limit; i++){
sum += Integer.valueOf(rdr.readLine());
// uncomment to see interleaving of threads:
//System.out.println(Thread.currentThread().getName());
}
System.out.println(Thread.currentThread().getName() + " finished " + System.currentTimeMillis());
return sum;
}
}
}
What's to stop you from simply opening the file twice, and working with it as if it were two independent files?
File inputFile = new File("src/SameFileTwice.java");
BufferedReader in1 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile)));
BufferedReader in2 = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile)));
try {
String strLine;
while ((strLine = in1.readLine()) != null && (strLine = in2.readLine()) != null) {
System.out.println(strLine);
}
} finally {
in1.close();
in2.close();
}