DataOutputStream() VS DataOutputStream(new BufferedOutputStream()) - java

The code at Java Tutorials showed an example of using DataOutputStream class and DataInputStream class.
A snippet of the code looks like this:
//..
out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(dataFile)));
//..
in = new DataInputStream(new BufferedInputStream(new FileInputStream(dataFile)));
//..
I was wondering why is it required to create a new BufferedOutputStream when we create a new DataOutputStream ?
Isn't it redundant since this alternative works as well? : new DataOutputStream(new FileOutputStream(dataFile));
As this page claims, DataStreams already provides a buffered file output byte stream. So is "double-buffering" really required?
I've modified the 2 lines of code (output and input), taking away the BufferedOutputStream and BufferedInputStream and everything seems to work just fine, so I was wondering what is the purpose of the BufferedOutputStream and BufferedInputStream ?

Wrapping the FileOutputStream in a BufferedOutputStream will generally speed up the overall output of your program. This will only be noticeable if you are writing large amounts of data. The same thing goes for wrapping an InputStream in a BufferedInputStream. The use of buffers will only affect efficiency, not correctness.

It's not redundant, it's just different. The Buffered variants add a buffering layer, speeding up IO operations by batching up reads and writes.
Instead of going to disk for every read/write, it goes to memory first. How much of a difference it makes depends on a variety of factors. The OS and/or disk I/O system also likely does some buffering.

I used to think that the Java IO model was unnecessarily large, but now that I really "get it" I find it quite elegant. A BufferedOutputStream is an implementation of the Decorator pattern (google it... it's useful). What this means is that BufferedOutputStream simply adds functionality to the outputstream it wraps. Internally, the BufferedOutputStream calls what ever OutputStream it decorates.

Buffered IO streams help you to read in bulk thereby reducing the IO cost significantly. IO perations are fairly costly. Imagine your application doing a full read/write cycle for every byte that is read/written as opposed to reading/writing a chunk of data in one go. Doing a Buffered read/write is definitely very efficient. You will notice a huge difference in efficiency if you gather some performance statistics in both the cases i.e w and w/o Buffered IO specially when reading/writing a huge amount of data.

Related

java : Does using buffered input stream make the conversion of input streams to byte[] more efficient?

I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?
Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.
in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.

Transfer a File over a network using TCP (Speed up the transfer)

I have been trying to send a big file over a Socket connection, but it runs slowly and I was wondering if this code can be optimized in some way to improve the transfer speed.
This is my code for sending the file:
byte[] buffer = new byte[65536];
int number;
while ((number = fileInputStream.read(buffer)) != -1) {
socketOutputStream.write(buffer, 0, number);
}
socketOutputStream.close();
fileInputStream.close();
This is what I use to receive the file on the other machine:
byte[] buffer = new byte[65536];
InputStream socketStream= clientSocket.getInputStream();
File f=new File("C:\\output.dat");
OutputStream fileStream=new FileOutputStream(f);
while ((number = socketStream.read(buffer)) != -1) {
fileStream.write(buffer,0,number);
}
fileStream.close();
socketStream.close();
I think writing to the fileStream is taking the majority of the time. Could anyone offer any advise for speeding up this code.
There's nothing obviously wrong with that code, other than the lack of finally blocks for the close statements.
How long does it take for how much data? It's very unlikely that the FileOutputStream is what's taking the time - it's much more likely to be the network being slow. You could potentially read from the network and write to the file system in parallel, but that would be a lot of work to get right, and it's unlikely to give that much benefit, IMO.
You could try a BufferedOutputStream around the FileOutputStream. It would have the effect of block-aligning all disk writes, regardless of the count you read from the network. I wouldn't expect a major difference but it might help a bit.
I had a similar issue FTP'ing large files. I realized that using the same buffer for reading from the hard drive AND writing to the network was the issue. File system IO likes larger buffers because it is a lot less work for the hard drive to do all the seeking and reading. Networks on the other hand then to prefer smaller buffers for optimizing throughput.
The solution is to read from the hard disk using a large buffer, then write this buffer to the network stream in smaller chunks.
I was able to max out my NIC at 100% utilization for the entire length of any file with 4mb reads and 32kb writes. You can then do the mirrored version on the server by reading in 32kb at a time and storing it in memory then writing 4mb at a time to the hard drive.

Is FileInputStream using buffers already?

When I am using FileInputStream to read an object (say a few bytes), does the underlying operation involve:
1) Reading a whole block of disk so that if I subsequently do another read operation, it wouldnt require a real disk read as that portion of the file was already fetched in the last read operation?
OR
2) A new disk access to take place because FileInputStream does not do any buffering and bufferedInputStream should have been used instead to achieve the effect of (1)?
I think that since the FileInputStream uses the read system call and it reads only a set of pages from hard disk, some buffering must be take place.
FileInputStream will make an underlying native system call. Most OSes will do their own buffering for this. So it does not need a real disk seek for each byte. But still, you have the cost of making the native OS call, which is expensive. So BufferedStream would be preferable. However, for reading small amounts of data (like you say, a few bytes or even kBs), either one should be fine as the number of OS calls won't be that different.
Native code for FileInputStream is here: it doesn't look like there is any buffering going on in there. The OS buffering may kick in, but there's no explicit indicator one way or another if/when that happens.
One thing to look out for is reading from a mounted network volume over a slow connection. I ran into a big performance issue using a non-buffered FileInputStream for this. Didn't catch it in development, because the file system was local.

What about buffering FileInputStream?

I have a piece of code that reads hell of a lot (hundreds of thousand) of relatively small files (couple of KB) from the local file system in a loop. For each file there is a java.io.FileInputStream created to read the content. The process its very slow and take ages.
Do you think that wrapping the FIS into java.io.BufferedInputStream would make a significant difference?
If you aren't already using a byte[] buffer of a decent size in the read/write loop (the latest implementation of BufferedInputStream uses 8KB), then it will certainly make difference. Give it a try yourself. Don't forget to make any OutputStream a BufferedOutputStream as well.
But if you already have buffered it using a byte[] and/or it after all makes only little difference, then you've hit the harddisk and I/O controller speed as the bottleneck.
I very much doubt whether that will make any difference.
Your fundamental problem is the hundreds of throusands of tiny files. Reading those is going to make the disk thrash and take forever, no matter how you do it, you'll spend 99,9% of the time waiting on mechanical movement inside the harddisk.
There are two ways to fix this:
Save your data on an SSD - they have much lower (as in five orders of magnitude less) latency.
Rearrange your data into few large files and read those sequentially
That depends on how you're reading the data. If you're reading from the FileInputStream in a very inefficient way (for example, calling read() byte-by-byte), then using a BufferedInputStream could improve things dramatically. But if you're already using a reasonable-sized buffer with FileInputStream, switching to a BufferedInputStream won't matter.
Since you're talking a large number of very small files, there's a strong possibility that a lot of the delay is due to directory operations (open, close), not the actual reading of bytes from the files.

What order should I use GzipOutputStream and BufferedOutputStream

Can anyone recommend whether I should do something like:
os = new GzipOutputStream(new BufferedOutputStream(...));
or
os = new BufferedOutputStream(new GzipOutputStream(...));
Which is more efficient? Should I use BufferedOutputStream at all?
GZIPOutputStream already comes with a built-in buffer. So, there is no need to put a BufferedOutputStream right next to it in the chain. gojomo's excellent answer already provides some guidance on where to place the buffer.
The default buffer size for GZIPOutputStream is only 512 bytes, so you will want to increase it to 8K or even 64K via the constructor parameter. The default buffer size for BufferedOutputStream is 8K, which is why you can measure an advantage when combining the default GZIPOutputStream and BufferedOutputStream. That advantage can also be achieved by properly sizing the GZIPOutputStream's built-in buffer.
So, to answer your question: "Should I use BufferedOutputStream at all?" → No, in your case, you should not use it, but instead set the GZIPOutputStream's buffer to at least 8K.
What order should I use GzipOutputStream and BufferedOutputStream
For object streams, I found that wrapping the buffered stream around the gzip stream for both input and output was almost always significantly faster. The smaller the objects, the better this did. Better or the same in all cases then no buffered stream.
ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(fis)));
oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(fos)));
However, for text and straight byte streams, I found that it was a toss up -- with the gzip stream around the buffered stream being only slightly better. But better in all cases then no buffered stream.
reader = new InputStreamReader(new GZIPInputStream(new BufferedInputStream(fis)));
writer = new OutputStreamWriter(new GZIPOutputStream(new BufferedOutputStream(fos)));
I ran each version 20 times and cut off the first run and averaged the rest. I also tried buffered-gzip-buffered which was slightly better for objects and worse for text. I did not play with buffer sizes at all.
For the object streams, I tested 2 serialized object files in the 10s of megabytes. For the larger file (38mb), it was 85% faster on reading (0.7 versus 5.6 seconds) but actually slightly slower for writing (5.9 versus 5.7 seconds). These objects had some large arrays in them which may have meant larger writes.
method crc date time compressed uncompressed ratio
defla eb338650 May 19 16:59 14027543 38366001 63.4%
For the smaller file (18mb), it was 75% faster for reading (1.6 versus 6.1 seconds) and 40% faster for writing (2.8 versus 4.7 seconds). It contained a large number of small objects.
method crc date time compressed uncompressed ratio
defla 92c9d529 May 19 16:56 6676006 17890857 62.7%
For the text reader/writer I used a 64mb csv text file. The gzip stream around the buffered stream was 11% faster for reading (950 versus 1070 milliseconds) and slightly faster when writing (7.9 versus 8.1 seconds).
method crc date time compressed uncompressed ratio
defla c6b72e34 May 20 09:16 22560860 63465800 64.5%
The buffering helps when the ultimate destination of the data is best read/written in larger chunks than your code would otherwise push it. So you generally want the buffering to be as close to the place-that-wants-larger-chunks. In your examples, that's the elided "...", so wrap the BufferedOutputStream with the GzipOutputStream. And, tune the BufferedOutputStream buffer size to match what testing shows works best with the destination.
I doubt the BufferedOutputStream on the outside would help much, if at all, over no explicit buffering. Why not? The GzipOutputStream will do its write()s to "..." in the same-sized chunks whether the outside buffering is present or not. So there's no optimizing for "..." possible; you're stuck with what sizes GzipOutputStream write()s.
Note also that you're using memory more efficiently by buffering the compressed data rather than the uncompressed data. If your data often acheives 6X compression, the 'inside' buffer is equivalent to an 'outside' buffer 6X as big.
Normally you want a buffer close to your FileOutputStream (assuming that's what ... represents) to avoid too many calls into the OS and frequent disk access. However, if you're writing a lot of small chunks to the GZIPOutputStream you might benefit from a buffer around GZIPOS as well. The reason being the write method in GZIPOS is synchronized and also leads to few other synchronized calls and a couple of native (JNI) calls (to update the CRC32 and do the actual compression). These all add extra overhead per call. So in that case I'd say you'll benefit from both buffers.
I suggest you try a simple benchmark to time how long it take to compress a large file and see if it makes much difference. GzipOutputStream does have buffering but it is a smaller buffer. I would do the first with a 64K buffer, but you might find that doing both is better.
Read the javadoc, and you will discover that BIS is used to buffer bytes read from some original source. Once you get the raw bytes you want to compress them so you wrap BIS with a GIS. It makes no sense to buffer the output from a GZIP, because one needs to think what about buffering GZIP, who is going to do that ?
new GzipInputStream( new BufferedInputStream ( new FileInputXXX

Categories