What order should I use GzipOutputStream and BufferedOutputStream - java

Can anyone recommend whether I should do something like:
os = new GzipOutputStream(new BufferedOutputStream(...));
or
os = new BufferedOutputStream(new GzipOutputStream(...));
Which is more efficient? Should I use BufferedOutputStream at all?

GZIPOutputStream already comes with a built-in buffer. So, there is no need to put a BufferedOutputStream right next to it in the chain. gojomo's excellent answer already provides some guidance on where to place the buffer.
The default buffer size for GZIPOutputStream is only 512 bytes, so you will want to increase it to 8K or even 64K via the constructor parameter. The default buffer size for BufferedOutputStream is 8K, which is why you can measure an advantage when combining the default GZIPOutputStream and BufferedOutputStream. That advantage can also be achieved by properly sizing the GZIPOutputStream's built-in buffer.
So, to answer your question: "Should I use BufferedOutputStream at all?" → No, in your case, you should not use it, but instead set the GZIPOutputStream's buffer to at least 8K.

What order should I use GzipOutputStream and BufferedOutputStream
For object streams, I found that wrapping the buffered stream around the gzip stream for both input and output was almost always significantly faster. The smaller the objects, the better this did. Better or the same in all cases then no buffered stream.
ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(fis)));
oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(fos)));
However, for text and straight byte streams, I found that it was a toss up -- with the gzip stream around the buffered stream being only slightly better. But better in all cases then no buffered stream.
reader = new InputStreamReader(new GZIPInputStream(new BufferedInputStream(fis)));
writer = new OutputStreamWriter(new GZIPOutputStream(new BufferedOutputStream(fos)));
I ran each version 20 times and cut off the first run and averaged the rest. I also tried buffered-gzip-buffered which was slightly better for objects and worse for text. I did not play with buffer sizes at all.
For the object streams, I tested 2 serialized object files in the 10s of megabytes. For the larger file (38mb), it was 85% faster on reading (0.7 versus 5.6 seconds) but actually slightly slower for writing (5.9 versus 5.7 seconds). These objects had some large arrays in them which may have meant larger writes.
method crc date time compressed uncompressed ratio
defla eb338650 May 19 16:59 14027543 38366001 63.4%
For the smaller file (18mb), it was 75% faster for reading (1.6 versus 6.1 seconds) and 40% faster for writing (2.8 versus 4.7 seconds). It contained a large number of small objects.
method crc date time compressed uncompressed ratio
defla 92c9d529 May 19 16:56 6676006 17890857 62.7%
For the text reader/writer I used a 64mb csv text file. The gzip stream around the buffered stream was 11% faster for reading (950 versus 1070 milliseconds) and slightly faster when writing (7.9 versus 8.1 seconds).
method crc date time compressed uncompressed ratio
defla c6b72e34 May 20 09:16 22560860 63465800 64.5%

The buffering helps when the ultimate destination of the data is best read/written in larger chunks than your code would otherwise push it. So you generally want the buffering to be as close to the place-that-wants-larger-chunks. In your examples, that's the elided "...", so wrap the BufferedOutputStream with the GzipOutputStream. And, tune the BufferedOutputStream buffer size to match what testing shows works best with the destination.
I doubt the BufferedOutputStream on the outside would help much, if at all, over no explicit buffering. Why not? The GzipOutputStream will do its write()s to "..." in the same-sized chunks whether the outside buffering is present or not. So there's no optimizing for "..." possible; you're stuck with what sizes GzipOutputStream write()s.
Note also that you're using memory more efficiently by buffering the compressed data rather than the uncompressed data. If your data often acheives 6X compression, the 'inside' buffer is equivalent to an 'outside' buffer 6X as big.

Normally you want a buffer close to your FileOutputStream (assuming that's what ... represents) to avoid too many calls into the OS and frequent disk access. However, if you're writing a lot of small chunks to the GZIPOutputStream you might benefit from a buffer around GZIPOS as well. The reason being the write method in GZIPOS is synchronized and also leads to few other synchronized calls and a couple of native (JNI) calls (to update the CRC32 and do the actual compression). These all add extra overhead per call. So in that case I'd say you'll benefit from both buffers.

I suggest you try a simple benchmark to time how long it take to compress a large file and see if it makes much difference. GzipOutputStream does have buffering but it is a smaller buffer. I would do the first with a 64K buffer, but you might find that doing both is better.

Read the javadoc, and you will discover that BIS is used to buffer bytes read from some original source. Once you get the raw bytes you want to compress them so you wrap BIS with a GIS. It makes no sense to buffer the output from a GZIP, because one needs to think what about buffering GZIP, who is going to do that ?
new GzipInputStream( new BufferedInputStream ( new FileInputXXX

Related

Okio vs java.io performance

I read the following blog:
https://medium.com/#jerzy.chalupski/a-closer-look-at-the-okio-library-90336e37261
It is said that"
the Sinks and Sources are often connected into a pipe. Smart folks at Square realized that there’s no need to copy the data between such pipe components like the java.io buffered streams do. All Sources and Sinks use Buffers under the hood, and Buffers keep the data in Segments, so quite often you can just take an entire Segment from one Buffer and move it to another."
I just dont understand where is the copy of data in java.io.
And in which case a Segment would be moved to another Buffer.
After i read source code of Okio. If writing strings to file by Okio like the following:
val sink = logFile.appendingSink().buffer()
sink.writeUtf8("xxxx")
there will be no "moving segment to another Buffer". Am i right?
Java's BufferedReader is just a Reader that buffers data into a buffer – the buffer being a char[], or something like that – so that every time you need a bunch of bytes/chars from it, it doesn't need to read bytes from a file/network/whatever your byte source is (as long as it has buffered enough bytes). A BufferedWriter does the opposite operation: whenever you write a bunch of bytes to the BufferedWriter, it doesn't actually write bytes to a file/socket/whatever, but it "parks" them into a buffer, so that it can flush the buffer only once it's full.
Overall, this minimises access to file/network/whatever, as it could be expensive.
When you pipe a BufferedReader to a BufferedWriter, you effectively have 2 buffers. How does Java move bytes from one buffer to the other? It copies them from the source to the sink using System.arraycopy (or something equivalent). Everything works well, except that copying a bunch of bytes requires an amount of time that grows linearly as the size of the buffer(s) grow. Hence, copying 1 MB will take roughly 1000 times more than copying 1 KB.
Okio, on the other hand, doesn't actually copy bytes. Oversimplifying the way it works, Okio has a single byte[] with the actual bytes, and the only thing that gets moved from the source to the sink is the pointer (or reference) to that byte[], which requires the same amount of time regardless of its size.

java : Does using buffered input stream make the conversion of input streams to byte[] more efficient?

I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?
Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.
in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.

Java read huge file ( ~100GB ) efficiently

I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?
If this is a binary file, then reading in "lines" does not make a lot of sense.
If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.
And repeat.
Tips:
Use a bounded buffer in case you can read lines faster than you can process them.
Recycle the byte[] objects to reduce garbage generation.
If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().
The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.
If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().
Does reading in chunks work?
BufferedReader or BufferedInputStream both read in chunks, under the covers.
What will be the optimum buffer size?
That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.
Any formula for that?
No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.
Java 8, streaming
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});

Buffer Size for BufferedInputStream

How to determine the size of Buffer, while using Buffered Input Stream for reading batch of files? Is it based on the File size?I'm using,
byte[] buf = new byte[4096];
If i increase the buffer size,it will read quickly?
The default, which is deliberately undocumented, is 8192 bytes. Unless you have a compelling reason to change it, don't change it.
You can easily test it yourself, but it's not really a big issue. A few kilobytes is enough for the buffer, so you'll get good reading speeds.
If you profile your application and do realize that File IO is a performance bottleneck, there are ways to make it quicker, such as memorymapping a file.
What you show there is the "byte size" that you are reading into (the array).
If you are reading from a FileInputStream (i.e. non buffered), then changing that size will change the read size, yes.
This is different than the internal buffer used by BufferedInputStream. It doesn't have a getter but you can specify the size in the constructor and "remember it" from that I suppose. Default is 8K which may not be optimal.

DataOutputStream() VS DataOutputStream(new BufferedOutputStream())

The code at Java Tutorials showed an example of using DataOutputStream class and DataInputStream class.
A snippet of the code looks like this:
//..
out = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(dataFile)));
//..
in = new DataInputStream(new BufferedInputStream(new FileInputStream(dataFile)));
//..
I was wondering why is it required to create a new BufferedOutputStream when we create a new DataOutputStream ?
Isn't it redundant since this alternative works as well? : new DataOutputStream(new FileOutputStream(dataFile));
As this page claims, DataStreams already provides a buffered file output byte stream. So is "double-buffering" really required?
I've modified the 2 lines of code (output and input), taking away the BufferedOutputStream and BufferedInputStream and everything seems to work just fine, so I was wondering what is the purpose of the BufferedOutputStream and BufferedInputStream ?
Wrapping the FileOutputStream in a BufferedOutputStream will generally speed up the overall output of your program. This will only be noticeable if you are writing large amounts of data. The same thing goes for wrapping an InputStream in a BufferedInputStream. The use of buffers will only affect efficiency, not correctness.
It's not redundant, it's just different. The Buffered variants add a buffering layer, speeding up IO operations by batching up reads and writes.
Instead of going to disk for every read/write, it goes to memory first. How much of a difference it makes depends on a variety of factors. The OS and/or disk I/O system also likely does some buffering.
I used to think that the Java IO model was unnecessarily large, but now that I really "get it" I find it quite elegant. A BufferedOutputStream is an implementation of the Decorator pattern (google it... it's useful). What this means is that BufferedOutputStream simply adds functionality to the outputstream it wraps. Internally, the BufferedOutputStream calls what ever OutputStream it decorates.
Buffered IO streams help you to read in bulk thereby reducing the IO cost significantly. IO perations are fairly costly. Imagine your application doing a full read/write cycle for every byte that is read/written as opposed to reading/writing a chunk of data in one go. Doing a Buffered read/write is definitely very efficient. You will notice a huge difference in efficiency if you gather some performance statistics in both the cases i.e w and w/o Buffered IO specially when reading/writing a huge amount of data.

Categories