How to determine the size of Buffer, while using Buffered Input Stream for reading batch of files? Is it based on the File size?I'm using,
byte[] buf = new byte[4096];
If i increase the buffer size,it will read quickly?
The default, which is deliberately undocumented, is 8192 bytes. Unless you have a compelling reason to change it, don't change it.
You can easily test it yourself, but it's not really a big issue. A few kilobytes is enough for the buffer, so you'll get good reading speeds.
If you profile your application and do realize that File IO is a performance bottleneck, there are ways to make it quicker, such as memorymapping a file.
What you show there is the "byte size" that you are reading into (the array).
If you are reading from a FileInputStream (i.e. non buffered), then changing that size will change the read size, yes.
This is different than the internal buffer used by BufferedInputStream. It doesn't have a getter but you can specify the size in the constructor and "remember it" from that I suppose. Default is 8K which may not be optimal.
Related
I want to convert an input stream to byte[] and I'm using IOUtils.toByteArray(inputStream). Will it make more efficient by using a wrapper like BufferedInputStream for the inputStream ? Does it save memory ?
Will it make more efficient by wrapper like BufferedInputStream for
the inputStream ?
Not by any significance. IOUtils.toByteArray reads data into a buffer of 4096 bytes. BufferedInputStream uses a 8192 bytes buffer by default.
Using BufferedInputStream does fewer IO reads, but you need a very fast data source to notice any difference.
IF you read an InputStream one byte at a time (or a few bytes), then using a BufferedInputStream really improves performance because it reduces the number of operating system calls by a factor 8000. And operating system calls take a lot of time, comparatively.
Does it save memory ?
No. IOUtils.toByteArray will create a new byte[4096] regardless if whether pass in a buffered or an unbuffered InputStream. A BufferdedInputStream costs a bit more memory to create. But nothing significant.
in terms of final memory consumption it wouldn't help, as you anyway will need to move the whole stream to byte[], the size of the array would be the same, so memory consumption would be the same.
What BufferedInputStream does, it wraps another stream and instead writing to it directly it buffers your input into internal buffer and writes to underlying stream only when it closes/flushes or when the internal buffer is full. It can make your write operations faster, as you will do them in batches instead of writing directly each time, but it wouldn't help if you reading it from another side.
I would like to read a huge binary file ( ~100GB ) efficiently in Java. I have to process each line of it . The line processing will be in separate threads. I don't want to load the whole file into memory. Does reading in chunks work? What will be the optimum buffer size? Any formula for that?
If this is a binary file, then reading in "lines" does not make a lot of sense.
If the file is really binary, then use a BufferedInputStream and read bytes one at a time into byte[]. When you get to the byte that marks your end of "line", add the byte[] and the count of bytes in the line to a queue for you worker threads to process.
And repeat.
Tips:
Use a bounded buffer in case you can read lines faster than you can process them.
Recycle the byte[] objects to reduce garbage generation.
If the file is (really) text, then you could use BufferedReader and the readLine() method instead of calling read().
The above will give you reasonable performance. Depending on how much work has to be done to process each line, it may be good enough that there is no point optimizing the file reading. You can check this by profiling.
If you profiling tells you that reading is the bottle-neck, then consider using NIO with ByteBuffer or CharBuffer. It is more complicated but potentially faster than read() or readLine().
Does reading in chunks work?
BufferedReader or BufferedInputStream both read in chunks, under the covers.
What will be the optimum buffer size?
That's probably not that important what the buffer size is. I'd make it a few KB or tens of KB.
Any formula for that?
No there isn't a formula for an optimum buffer size. It will depend on variables that you can't quantify.
Java 8, streaming
Stream<String> lines = Files.lines(Paths.get("c:\myfile.txt"));
lines.forEach(l -> {
// Do anything line by line
});
When calling read(byte[]) on a FileInputStream, the read size is always 8k, even if byte[] is exponentially large.
How do you increase the max read amount returned per call?
Please do not suggest a method that merely masks the limitation of FileInputStream.
Update: There doesn't seem to be a real solution to this. However, I calculated the method call overhead to about 226uS on my system, for 1G file. It's probably safe to say this is not going to impact the performance in any real way.
Wrap it in a BufferedInputStream which allows you to specify the buffer size.
You could try to memory map the file by using NIO, but I'm not sure what the problem with 8K is.
You can either copy the 8K to your bigger array or use the returned length to call
public int read(byte[] b,
int off,
int len)
throws IOException
With off being the return value from the last read.
The size of each read you see might be the buffer size used by the operating system itself. So you might have to make a change at the OS level. How you do that would be system dependent. You might be able to specify the block size when creating the file system. This has traditionally been possible for Unix filesystems. Although ironically I believe the feature was used to have smaller blocks for filesystems expected to have many small files.
I need to detect whether the file I am attaching to an email is exceeding the server limit. I am not allowed to increase the JVM heap size to do this since it is going to affect the application performance.
If I don’t increase the JVM heap size, I will run into OutOfMemoryError directly.
I would like to know how do allocate the memory from OS instead of increasing the JVM’s heap size?
Thanks a lot!
Are you really trying to read the entire file to determine its size to check if it is less than some configured value (your question is not too easy to understand)? If so, why are you not using File#length() instead?
If you need to stream the file to the server in order to find out whether it's too big, you still don't need to read the whole file into memory.
Instead, read maybe 10-100k into memory. Fill the buffer, send it to the server. Repeat until the file is done or the server complains. Then you don't need enough memory for the whole file.
If you write your own stream handling code, you could create your own counter to track the number of bytes transmitted. I'd be surprised if there wasn't already some sort of Filter class that does this for you. Sun has a page about this. Search for 'CountReader'.
You could allocate the memory natively via native code and JNI. However that sounds a painful way to do this.
Instead can't you give the JVM suitable memory configurations (via -Xmx) ? If the document you're mailing is so big that you can't easily handle it, then I'm not sure email is the correct medium to transfer it, and you should instead host it and send a link to it, or perhaps FTP it.
If all the other solutions turn out to be unusable (and I would encourage you to find a better way than requiring the entire file to fit in memory!) you could consider using a direct ByteBuffer. It has the option of using mmap() or other system calls to map a file into your memory without actually reading / allocating space in the heap. You can do this by calling map() on a FileChannel -- API documentation. Note that this is potentially expensive and/or not supported on some platforms, so it should be considered suboptimal compared to any solution which does not require the entire file to be in memory.
Socket s = /* go get your socket to the server */
InputStream is = new FileInputStream("foo.txt");
OutputStream os = s.getOutputStream();
byte[] buf = new byte[4096];
for(int len=-1;(len=is.read(buf))!=-1;) os.write(buf,0,len);
os.close();
is.close();
Of course handle your Exceptions.
If you're not allowed to increase the heap size because of memory constaints, doing an "under the table" memory allocation would cause the same problems. It sounds like you're looking for a loophole in the rules. Like, "My doctor says to cut down on how much I eat at each meal, so I'm eating more between meals to make up for it."
The only way I know of to allocate memory without using the Java heap would be to write JNDI calls to malloc the memory with C. But then how would you use this memory? You'd have to write more JNDI calls to interact with it. I think you'd end up basically re-inventing Java.
If the goal here is to send a large file, just use buffered streams and read/write it one byte at a time. A buffered stream, as the name implies, will take care of buffering for you so you're not really hitting the hard drive one byte at a time. It will really read, I think the default is 8k at a time, and then pass these bytes to you as you ask for them. Likewise, on the write side it will save up a few kb and and send them all in chunks.
So all you should have to do is open a BufferedInputStream and a BufferedOutputStream. Then write a loop that reads one byte from the input stream and writes it to the output stream until you hit end-of-file.
Something like:
OutputStream os=... however you're getting your socket ...
BufferedInputStream bis=new BufferedInputStream(new FileInputStream(fileObject));
BufferedOutputStream bos=new BufferedOutputStream(os);
int b;
while ((b=bis.read())!=-1)
bos.write(b);
bis.close();
bos.close();
No need to make life complicated for yourself by re-inventing buffering.
while (
Can anyone recommend whether I should do something like:
os = new GzipOutputStream(new BufferedOutputStream(...));
or
os = new BufferedOutputStream(new GzipOutputStream(...));
Which is more efficient? Should I use BufferedOutputStream at all?
GZIPOutputStream already comes with a built-in buffer. So, there is no need to put a BufferedOutputStream right next to it in the chain. gojomo's excellent answer already provides some guidance on where to place the buffer.
The default buffer size for GZIPOutputStream is only 512 bytes, so you will want to increase it to 8K or even 64K via the constructor parameter. The default buffer size for BufferedOutputStream is 8K, which is why you can measure an advantage when combining the default GZIPOutputStream and BufferedOutputStream. That advantage can also be achieved by properly sizing the GZIPOutputStream's built-in buffer.
So, to answer your question: "Should I use BufferedOutputStream at all?" → No, in your case, you should not use it, but instead set the GZIPOutputStream's buffer to at least 8K.
What order should I use GzipOutputStream and BufferedOutputStream
For object streams, I found that wrapping the buffered stream around the gzip stream for both input and output was almost always significantly faster. The smaller the objects, the better this did. Better or the same in all cases then no buffered stream.
ois = new ObjectInputStream(new BufferedInputStream(new GZIPInputStream(fis)));
oos = new ObjectOutputStream(new BufferedOutputStream(new GZIPOutputStream(fos)));
However, for text and straight byte streams, I found that it was a toss up -- with the gzip stream around the buffered stream being only slightly better. But better in all cases then no buffered stream.
reader = new InputStreamReader(new GZIPInputStream(new BufferedInputStream(fis)));
writer = new OutputStreamWriter(new GZIPOutputStream(new BufferedOutputStream(fos)));
I ran each version 20 times and cut off the first run and averaged the rest. I also tried buffered-gzip-buffered which was slightly better for objects and worse for text. I did not play with buffer sizes at all.
For the object streams, I tested 2 serialized object files in the 10s of megabytes. For the larger file (38mb), it was 85% faster on reading (0.7 versus 5.6 seconds) but actually slightly slower for writing (5.9 versus 5.7 seconds). These objects had some large arrays in them which may have meant larger writes.
method crc date time compressed uncompressed ratio
defla eb338650 May 19 16:59 14027543 38366001 63.4%
For the smaller file (18mb), it was 75% faster for reading (1.6 versus 6.1 seconds) and 40% faster for writing (2.8 versus 4.7 seconds). It contained a large number of small objects.
method crc date time compressed uncompressed ratio
defla 92c9d529 May 19 16:56 6676006 17890857 62.7%
For the text reader/writer I used a 64mb csv text file. The gzip stream around the buffered stream was 11% faster for reading (950 versus 1070 milliseconds) and slightly faster when writing (7.9 versus 8.1 seconds).
method crc date time compressed uncompressed ratio
defla c6b72e34 May 20 09:16 22560860 63465800 64.5%
The buffering helps when the ultimate destination of the data is best read/written in larger chunks than your code would otherwise push it. So you generally want the buffering to be as close to the place-that-wants-larger-chunks. In your examples, that's the elided "...", so wrap the BufferedOutputStream with the GzipOutputStream. And, tune the BufferedOutputStream buffer size to match what testing shows works best with the destination.
I doubt the BufferedOutputStream on the outside would help much, if at all, over no explicit buffering. Why not? The GzipOutputStream will do its write()s to "..." in the same-sized chunks whether the outside buffering is present or not. So there's no optimizing for "..." possible; you're stuck with what sizes GzipOutputStream write()s.
Note also that you're using memory more efficiently by buffering the compressed data rather than the uncompressed data. If your data often acheives 6X compression, the 'inside' buffer is equivalent to an 'outside' buffer 6X as big.
Normally you want a buffer close to your FileOutputStream (assuming that's what ... represents) to avoid too many calls into the OS and frequent disk access. However, if you're writing a lot of small chunks to the GZIPOutputStream you might benefit from a buffer around GZIPOS as well. The reason being the write method in GZIPOS is synchronized and also leads to few other synchronized calls and a couple of native (JNI) calls (to update the CRC32 and do the actual compression). These all add extra overhead per call. So in that case I'd say you'll benefit from both buffers.
I suggest you try a simple benchmark to time how long it take to compress a large file and see if it makes much difference. GzipOutputStream does have buffering but it is a smaller buffer. I would do the first with a 64K buffer, but you might find that doing both is better.
Read the javadoc, and you will discover that BIS is used to buffer bytes read from some original source. Once you get the raw bytes you want to compress them so you wrap BIS with a GIS. It makes no sense to buffer the output from a GZIP, because one needs to think what about buffering GZIP, who is going to do that ?
new GzipInputStream( new BufferedInputStream ( new FileInputXXX