I was trying to read a file into an array by using FileInputStream, and an ~800KB file took about 3 seconds to read into memory. I then tried the same code except with the FileInputStream wrapped into a BufferedInputStream and it took about 76 milliseconds. Why is reading a file byte by byte done so much faster with a BufferedInputStream even though I'm still reading it byte by byte? Here's the code (the rest of the code is entirely irrelevant). Note that this is the "fast" code. You can just remove the BufferedInputStream if you want the "slow" code:
InputStream is = null;
try {
is = new BufferedInputStream(new FileInputStream(file));
int[] fileArr = new int[(int) file.length()];
for (int i = 0, temp = 0; (temp = is.read()) != -1; i++) {
fileArr[i] = temp;
}
BufferedInputStream is over 30 times faster. Far more than that. So, why is this, and is it possible to make this code more efficient (without using any external libraries)?
In FileInputStream, the method read() reads a single byte. From the source code:
/**
* Reads a byte of data from this input stream. This method blocks
* if no input is yet available.
*
* #return the next byte of data, or <code>-1</code> if the end of the
* file is reached.
* #exception IOException if an I/O error occurs.
*/
public native int read() throws IOException;
This is a native call to the OS which uses the disk to read the single byte. This is a heavy operation.
With a BufferedInputStream, the method delegates to an overloaded read() method that reads 8192 amount of bytes and buffers them until they are needed. It still returns only the single byte (but keeps the others in reserve). This way the BufferedInputStream makes less native calls to the OS to read from the file.
For example, your file is 32768 bytes long. To get all the bytes in memory with a FileInputStream, you will require 32768 native calls to the OS. With a BufferedInputStream, you will only require 4, regardless of the number of read() calls you will do (still 32768).
As to how to make it faster, you might want to consider Java 7's NIO FileChannel class, but I have no evidence to support this.
Note: if you used FileInputStream's read(byte[], int, int) method directly instead, with a byte[>8192] you wouldn't need a BufferedInputStream wrapping it.
A BufferedInputStream wrapped around a FileInputStream, will request data from the FileInputStream in big chunks (512 bytes or so by default, I think.) Thus if you read 1000 characters one at a time, the FileInputStream will only have to go to the disk twice. This will be much faster!
It is because of the cost of disk access. Lets assume you will have a file which size is 8kb. 8*1024 times access disk will be needed to read this file without BufferedInputStream.
At this point, BufferedStream comes to the scene and acts as a middle man between FileInputStream and the file to be read.
In one shot, will get chunks of bytes default is 8kb to memory and then FileInputStream will read bytes from this middle man.
This will decrease the time of the operation.
private void exercise1WithBufferedStream() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
BufferedInputStream bufferedInputStream = new BufferedInputStream(myFile);
boolean eof = false;
while (!eof) {
int inByteValue = bufferedInputStream.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed with buffered:" + (System.currentTimeMillis()-start));
}
private void exercise1() {
long start= System.currentTimeMillis();
try (FileInputStream myFile = new FileInputStream("anyFile.txt")) {
boolean eof = false;
while (!eof) {
int inByteValue = myFile.read();
if (inByteValue == -1) eof = true;
}
} catch (IOException e) {
System.out.println("Could not read the stream...");
e.printStackTrace();
}
System.out.println("time passed without buffered:" + (System.currentTimeMillis()-start));
}
Related
I wrote a piece of Java code to send PDF-turned postscript scripts to a network printer via Socket.
The files were printed in perfect shape but every job comes with one or 2 extra pages with texts like ps: stack underflow or error undefined offending command.
At beginning I thought something is wrong with the PDF2PS process so I tried 2 PS files from this PS Files. But the problem is still there.
I also verified the ps files with GhostView. Now I think there may be something wrong with the code. The code does not throw any exception.
The printer, Toshiba e-studion 5005AC, supports PS3 and PCL6.
File file = new File("/path/to/my.ps");
Socket socket = null;
DataOutputStream out = null;
FileInputStream inputStream = null;
try {
socket = new Socket(printerIP, printerPort);
out = new DataOutputStream(socket.getOutputStream());
DataInputStream input = new DataInputStream(socket.getInputStream());
inputStream = new FileInputStream(file);
byte[] buffer = new byte[8000];
while (inputStream.read(buffer) != -1) {
out.write(buffer);
}
out.flush();
} catch (IOException e) {
e.printStackTrace();
}
You are writing the whole buffer to the output stream regardless of how much actual content there is.
That means that when you write the buffer the last time it will most probably have a bunch of content from the previous iteration at the end of the buffer.
Example
e.g. imagine you have the following file and you use a buffer of size 10:
1234567890ABCDEF
After first inputStream.read() call it will return 10 and in the buffer you will have:
1234567890
After second inputStream.read() call it will return 6 and in the buffer you will have:
ABCDEF7890
After third inputStream.read() call it will return -1 and you will stop reading.
A printer socket will receive these data in the end:
1234567890ABCDEF7890
Here the last 7890 is an extra bit that the printer does not understand, but it can successfully interpret the first 1234567890ABCDEF.
Fix
You should consider the length returned by inputStream.read():
byte[] buffer = new byte[8000];
for (int length; (length = inputStream.read(buffer)) != -1; ){
out.write(buffer, 0, length);
}
Also consider using try-with-resources to avoid problems with unclosed streams.
I'm trying to serialize Object between NIO SocketChannel and blocking IO Socket. Since I can't use Serializable/writeObject on NIO, I thought to write code to serialize object into an ByteArrayOutputStream then send array length followed by array.
Sender function is
public void writeObject(Object obj) throws IOException{
ByteArrayOutputStream serializedObj = new ByteArrayOutputStream();
ObjectOutputStream writer = new ObjectOutputStream(serializedObj);
writer.writeUnshared(obj);
ByteBuffer size = ByteBuffer.allocate(4).putInt(serializedObj.toByteArray().length);
this.getSocket().write(size);
this.getSocket().write(ByteBuffer.wrap(serializedObj.toByteArray()));
}
and receiver is:
public Object readObject(){
try {
//Leggi dimensione totale pacchetto
byte[] dimension = new byte[4];
int byteRead = 0;
while(byteRead < 4) {
byteRead += this.getInputStream().read(dimension, byteRead, 4 - byteRead);
}
int size = ByteBuffer.wrap(dimension).getInt(); /* (*) */
System.out.println(size);
byte[] object = new byte[size];
while(size > 0){
size -= this.getInputStream().read(object);
}
InputStream in = new ByteArrayInputStream(object, 0, object.length);
ObjectInputStream ois = new ObjectInputStream(in);
Object res = ois.readUnshared();
ois.close();
return res;
} catch (IOException | ClassNotFoundException e) {
return null;
}
}
The problem is that size (*) is always equals to -1393754107 when serializedObj.toByteArray().length in my test is 316.
I don't understand why casting not works properly.
this.getSocket().write(size);
this.getSocket().write(ByteBuffer.wrap(serializedObj.toByteArray()));
If the result of getSocket() is a SocketChannel in non-blocking mode, the problem is here. You aren't checking the result of write(). In non-blocking mode it can write less than the number of bytes remaining in the ByteBuffer; indeed it can write zero bytes.
So youu aren't writing all the data you think you're writing, so the other end overruns and reads the next length word as part of the data being written, and reads part of the next data as the next length word, and gets a wrong answer. I'm surprised it didn't barf earlier. In fact it probably did, but your deplorable practice of ignoring IOExceptions masked it. Don't do that. Log them.
So you need to loop until all requested data has been written, and if any write() returns zero you need to select on OP_WRITE until it fires, which adds a considerable complication into your code as you have to return to the select loop while remembering that there is an outstanding ByteBuffer with data remaining to be written. And when you get the OP_WRITE and the writes complete you have to deregister interest in OP_WRITE, as it's only of interest after a write() has returned zero.
NB There is no casting in your code.
The problem was write() returned 0 always. This happens because the buffer wasn't flipped before write().
according to :
Note that while some implementations of InputStream will return the
total number of bytes in the stream, many will not. It is never
correct to use the return value of this method to allocate a buffer
intended to hold all data in this stream.
from:
http://docs.oracle.com/javase/7/docs/api/java/io/InputStream.html#available%28%29
and this note
In particular, code of the form
int n = in.available();
byte buf = new byte[n];
in.read(buf);
is not guaranteed to read all of the remaining bytes from the given input stream.
http://docs.oracle.com/javase/8/docs/technotes/guides/io/troubleshooting.html
dose it mean that using below function cause not to read file completely?
/**
* Reads a file from /raw/res/ and returns it as a byte array
* #param res Resources instance for Mosembro
* #param resourceId ID of resource (ex: R.raw.resource_name)
* #return byte[] if successful, null otherwise
*/
public static byte[] readRawByteArray(Resources res, int resourceId)
{
InputStream is = null;
byte[] raw = new byte[] {};
try {
is = res.openRawResource(resourceId);
raw = new byte[is.available()];
is.read(raw);
}
catch (IOException e) {
e.printStackTrace();
raw = null;
}
finally {
try {
is.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
return raw;
}
available() returns the number of bytes that can be read without blocking. There is no necessary correlation between that number, which can be zero, and the total length of the file.
Yes it does not necessarily read all. Like RandomAccessFile.read(byte[]) as opposed to RandomAccessFile.readFully(byte[]). Furthermore the code actually physically reads 0 bytes.
It probably reads only the first block, if it were a slow device like a file system.
The principle:
The file is being read by the underlying system software, normally
buffered, so you have a couple of blocks already in memory, and
sometimes already reading further. The software reads asynchrone
blocks, and blocks if trying to read more than the system has
already read.
So in general one has in the software a read loop of a block, and regularly at a read the read operation blocks till the physical read sufficiently buffers.
To hope for a non-blocking you would need to do:
InputStream is = res.openRawResource(resourceId);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
for (;;) {
// Read bytes until no longer available:
for (;;) {
int n = is.available();
if (n == 0) {
break;
}
byte[] part = new byte[n];
int nread = is.read(part);
assert nread == n;
baos.write(part, 0, nread);
}
// Still a probably blocking read:
byte[] part = new byte[128];
int nread = is.read(part);
if (nread <= 0) {
break; // End of file
}
baos.write(part, 0, nread);
}
return baos.toByteArray();
Now, before you copy that code, simply do a blocking read loop immediately. I cannot see an advantage of using available() unless you can do something with partial data while reading the rest.
I have to read a 53 MB file character by character. When I do it in C++ using ifstream, it is completed in milliseconds but using Java InputStream it takes several minutes. Is it normal for Java to be this slow or am I missing something?
Also, I need to complete the program in Java (it uses servlets from which I have to call the functions which process these characters). I was thinking maybe writing the file processing part in C or C++ and then using Java Native Interface to interface these functions with my Java programs... How is this idea?
Can anyone give me any other tip... I seriously need to read the file faster. I tried using buffered input, but still it is not giving performance even close to C++.
Edited: My code spans several files and it is very dirty so I am giving the synopsis
import java.io.*;
public class tmp {
public static void main(String args[]) {
try{
InputStream file = new BufferedInputStream(new FileInputStream("1.2.fasta"));
char ch;
while(file.available()!=0) {
ch = (char)file.read();
/* Do processing */
}
System.out.println("DONE");
file.close();
}catch(Exception e){}
}
}
I ran this code with a 183 MB file. It printed "Elapsed 250 ms".
final InputStream in = new BufferedInputStream(new FileInputStream("file.txt"));
final long start = System.currentTimeMillis();
int cnt = 0;
final byte[] buf = new byte[1000];
while (in.read(buf) != -1) cnt++;
in.close();
System.out.println("Elapsed " + (System.currentTimeMillis() - start) + " ms");
I would try this
// create the file so we have something to read.
final String fileName = "1.2.fasta";
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(new byte[54 * 1024 * 1024]);
fos.close();
// read the file in one hit.
long start = System.nanoTime();
FileChannel fc = new FileInputStream(fileName).getChannel();
ByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
while (bb.remaining() > 0)
bb.getLong();
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to read %.1f MB%n", time / 1e9, fc.size() / 1e6);
fc.close();
((DirectBuffer) bb).cleaner().clean();
prints
Took 0.016 seconds to read 56.6 MB
Use a BufferedInputStream:
InputStream buffy = new BufferedInputStream(inputStream);
As noted above, use a BufferedInputStream. You could also use the NIO package. Note that for most files, BufferedInputStream will be just as fast reading as NIO. However, for extremely large files, NIO may do better because you can memory mapped file operations. Furthermore, the NIO package does interruptible IO, whereas the java.io package does not. That means if you want to cancel the operation from another thread, you have to use NIO to make it reliable.
ByteBuffer buf = ByteBuffer.allocate(BUF_SIZE);
FileChannel fileChannel = fileInputStream.getChannel();
int readCount = 0;
while ( (readCount = fileChannel.read(buf)) > 0) {
buf.flip();
while (buf.hasRemaining()) {
byte b = buf.get();
}
buf.clear();
}
Here is how I compressed the string into a file:
public static void compressRawText(File outFile, String src) {
FileOutputStream fo = null;
GZIPOutputStream gz = null;
try {
fo = new FileOutputStream(outFile);
gz = new GZIPOutputStream(fo);
gz.write(src.getBytes());
gz.flush();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
gz.close();
fo.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Here is how I decompressed it:
static int BUFFER_SIZE = 8 * 1024;
static int STRING_SIZE = 2 * 1024 * 1024;
public static String decompressRawText(File inFile) {
InputStream in = null;
InputStreamReader isr = null;
StringBuilder sb = new StringBuilder(STRING_SIZE);//constant resizing is costly, so set the STRING_SIZE
try {
in = new FileInputStream(inFile);
in = new BufferedInputStream(in, BUFFER_SIZE);
in = new GZIPInputStream(in, BUFFER_SIZE);
isr = new InputStreamReader(in);
char[] cbuf = new char[BUFFER_SIZE];
int length = 0;
while ((length = isr.read(cbuf)) != -1) {
sb.append(cbuf, 0, length);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
} catch (Exception e1) {
e1.printStackTrace();
}
}
return sb.toString();
}
The decompression seems to take forever to do. I have got a feeling that I am doing too much redundant steps in the decompression bit. any idea of how I could speed it up?
EDIT: have modified the code to the above based on the following given recommendations,
1. I chaged the pattern, so to simply my code a bit, but if I couldn't use IOUtils is this still ok to use this pattern?
2. I set the StringBuilder buffer to be of 2M, as suggested by entonio, should I set it to be a little bit more? the memory is still OK, I still have around 10M available as it is suggested by the heap monitor from eclipse
3. I cut the BufferedReader and added a BufferedInputStream, but I am still not sure about the BUFFER_SIZE, any suggestions?
The above modification has improved the time taken to loop all my 30 2M files from almost 30 seconds to around 14, but I need to reduce it to under 10, is it even possible on android? Ok, basically, I need to process a text file in all 60M, I have divided them up into 30 2M, and before I start processing on each strings, I did the above timing on the time cost for me just to loop all the files and get the String in the file into my memory. Since I don't have much experience, will it be better, if I use 60 of 1M files instead? or any other improvement should I adopt? Thanks.
ALSO: Since physical IO is quite time consuming, and since my compressed version of files are all quite small(around 2K from 2M of text), is it possible for me to still do the above, but on a file that is already mapped to memory? possibly using java NIO? Thanks
The BufferedReader's only purpose is the readLine() method you don't use, so why not just read from the InputStreamReader? Also, maybe decreasing the buffer size may be helpful. Also, you should probably specify the encoding while both reading and writing, though that shouldn't have an impact on performance.
edit: more data
If you know the size of the string ahead, you should add a length parameter to decompressRawText and use it to initialise the StringBuilder. Otherwise it will be constantly resized in order to accomodate the result, and that's costly.
edit: clarification
2MB implies a lot of resizes. There is no harm if you specify a capacity higher than the length you end up with after reading (other than temporarily using more memory, of course).
You should wrap the FileInputStream with a BufferedInputStream before wrapping with a GZipInputStream, rather than using a BufferedReader.
The reason is that, depending on implementation, any of the various input classes in your decoration hierarchy could decide to read on a byte-by-byte basis (and I'd say the InputStreamReader is most likely to do this). And that would translate into many read(2) calls once it gets to the FileInputStream.
Of course, this may just be superstition on my part. But, if you're running on Linux, you can always test with strace.
Edit: once nice pattern to follow when building up a bunch of stream delegates is to use a single InputStream variable. Then, you only have one thing to close in your finally block (and can use Jakarta Commons IOUtils to avoid lots of nested try-catch-finally blocks).
InputStream in = null;
try
{
in = new FileInputStream("foo");
in = new BufferedInputStream(in);
in = new GZIPInputStream(in);
// do something with the stream
}
finally
{
IOUtils.closeQuietly(in);
}
Add a BufferedInputStream between the FileInputStream and the GZIPInputStream.
Similarly when writing.