When calling read(byte[]) on a FileInputStream, the read size is always 8k, even if byte[] is exponentially large.
How do you increase the max read amount returned per call?
Please do not suggest a method that merely masks the limitation of FileInputStream.
Update: There doesn't seem to be a real solution to this. However, I calculated the method call overhead to about 226uS on my system, for 1G file. It's probably safe to say this is not going to impact the performance in any real way.
Wrap it in a BufferedInputStream which allows you to specify the buffer size.
You could try to memory map the file by using NIO, but I'm not sure what the problem with 8K is.
You can either copy the 8K to your bigger array or use the returned length to call
public int read(byte[] b,
int off,
int len)
throws IOException
With off being the return value from the last read.
The size of each read you see might be the buffer size used by the operating system itself. So you might have to make a change at the OS level. How you do that would be system dependent. You might be able to specify the block size when creating the file system. This has traditionally been possible for Unix filesystems. Although ironically I believe the feature was used to have smaller blocks for filesystems expected to have many small files.
Related
How to determine the size of Buffer, while using Buffered Input Stream for reading batch of files? Is it based on the File size?I'm using,
byte[] buf = new byte[4096];
If i increase the buffer size,it will read quickly?
The default, which is deliberately undocumented, is 8192 bytes. Unless you have a compelling reason to change it, don't change it.
You can easily test it yourself, but it's not really a big issue. A few kilobytes is enough for the buffer, so you'll get good reading speeds.
If you profile your application and do realize that File IO is a performance bottleneck, there are ways to make it quicker, such as memorymapping a file.
What you show there is the "byte size" that you are reading into (the array).
If you are reading from a FileInputStream (i.e. non buffered), then changing that size will change the read size, yes.
This is different than the internal buffer used by BufferedInputStream. It doesn't have a getter but you can specify the size in the constructor and "remember it" from that I suppose. Default is 8K which may not be optimal.
This does not look trivial, specially for a read/write buffered FileChannel. Is there anything opensource implemented somewhere that I can base my implementation on?
To be clear for those who did not understand:
FileChannel does buffereing in the OS level and I want to do buffering in the Java level. Read here to understand: FileChannel#force and buffering
#Peter I want to write a huge file to disk from a fast message stream. Buffering and batching are the way to go. So I want to batch in Java and then call FileChannel.write.
I recommend using a BufferedOutputStream wrapping a FileOutputStream. I do not believe you will see any performance improvement by mucking with ByteBuffer and FileChannel, and that you'll be left with a lot of hard-to-maintain code if you go that route.
The reasoning is quite simple: regardless of the approach you take, the steps involved are the same:
Generate bytes. You don't say how you plan to do this, and it could introduce an additional level of temporary buffering into the equation. But regardless, the Java data has to be turned into bytes.
Accumulate bytes into a buffer. You want to buffer your data before writing it, so that you're not making lots of small writes. That's a given. But where that buffer lives is immaterial.
Move bytes from Java heap to C heap, across JNI barrier. Writing a file is a native operation, and it doesn't read directly from the Java heap. So whether you buffer on the Java heap and then move the buffered bytes, or buffer in a direct ByteBuffer (and yes, you want a direct buffer), you're still moving the bytes. You will make more JNI calls with the ByteBuffer, but that's a marginal cost.
Invoke fwrite, a kernel call that copies bytes from the C heap into a kernel-maintained disk buffer.
Write the kernel buffer to disk. This will outweigh all the other steps combined, because disks are slow.
There may be a few microseconds gained or lost depending on exactly how you implement these steps, but you can't change the basic steps.
The FileChannel does give you the option to call force(), to ensure that step #5 actually happens. This is likely to actually decrease your overall performance, as the underlying fsync call will not return until the bytes are written. And if you really want to do it, you can always get the channel from the underlying stream.
Bottom line: I'm willing to bet that you're actually IO-bound, and there's no cure for that save better hardware.
FileChannel only works with ByteBuffers so it is naturally buffered. If you need additional buffering to can copy data from ByteBuffer to ByteBuffer but I am not sure why you would want to.
FileChannel does buffereing in the OS level
FileChannel does tell the OS what to do. The OS usually has a disk cache but FileChannel has no idea whether this is the case or not.
I want to do buffering in the Java level
You are in luck, because you don't have a choice. ;) This is the only option.
I would have two threads, the producer thread produces ByteBuffers and appends them to the tail a queue, the consumer thread remove some ByteBuffers from the head of the queue each time, and call fileChannel.write(ByteBuffer[]).
This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)
When I am using FileInputStream to read an object (say a few bytes), does the underlying operation involve:
1) Reading a whole block of disk so that if I subsequently do another read operation, it wouldnt require a real disk read as that portion of the file was already fetched in the last read operation?
OR
2) A new disk access to take place because FileInputStream does not do any buffering and bufferedInputStream should have been used instead to achieve the effect of (1)?
I think that since the FileInputStream uses the read system call and it reads only a set of pages from hard disk, some buffering must be take place.
FileInputStream will make an underlying native system call. Most OSes will do their own buffering for this. So it does not need a real disk seek for each byte. But still, you have the cost of making the native OS call, which is expensive. So BufferedStream would be preferable. However, for reading small amounts of data (like you say, a few bytes or even kBs), either one should be fine as the number of OS calls won't be that different.
Native code for FileInputStream is here: it doesn't look like there is any buffering going on in there. The OS buffering may kick in, but there's no explicit indicator one way or another if/when that happens.
One thing to look out for is reading from a mounted network volume over a slow connection. I ran into a big performance issue using a non-buffered FileInputStream for this. Didn't catch it in development, because the file system was local.
I need to detect whether the file I am attaching to an email is exceeding the server limit. I am not allowed to increase the JVM heap size to do this since it is going to affect the application performance.
If I don’t increase the JVM heap size, I will run into OutOfMemoryError directly.
I would like to know how do allocate the memory from OS instead of increasing the JVM’s heap size?
Thanks a lot!
Are you really trying to read the entire file to determine its size to check if it is less than some configured value (your question is not too easy to understand)? If so, why are you not using File#length() instead?
If you need to stream the file to the server in order to find out whether it's too big, you still don't need to read the whole file into memory.
Instead, read maybe 10-100k into memory. Fill the buffer, send it to the server. Repeat until the file is done or the server complains. Then you don't need enough memory for the whole file.
If you write your own stream handling code, you could create your own counter to track the number of bytes transmitted. I'd be surprised if there wasn't already some sort of Filter class that does this for you. Sun has a page about this. Search for 'CountReader'.
You could allocate the memory natively via native code and JNI. However that sounds a painful way to do this.
Instead can't you give the JVM suitable memory configurations (via -Xmx) ? If the document you're mailing is so big that you can't easily handle it, then I'm not sure email is the correct medium to transfer it, and you should instead host it and send a link to it, or perhaps FTP it.
If all the other solutions turn out to be unusable (and I would encourage you to find a better way than requiring the entire file to fit in memory!) you could consider using a direct ByteBuffer. It has the option of using mmap() or other system calls to map a file into your memory without actually reading / allocating space in the heap. You can do this by calling map() on a FileChannel -- API documentation. Note that this is potentially expensive and/or not supported on some platforms, so it should be considered suboptimal compared to any solution which does not require the entire file to be in memory.
Socket s = /* go get your socket to the server */
InputStream is = new FileInputStream("foo.txt");
OutputStream os = s.getOutputStream();
byte[] buf = new byte[4096];
for(int len=-1;(len=is.read(buf))!=-1;) os.write(buf,0,len);
os.close();
is.close();
Of course handle your Exceptions.
If you're not allowed to increase the heap size because of memory constaints, doing an "under the table" memory allocation would cause the same problems. It sounds like you're looking for a loophole in the rules. Like, "My doctor says to cut down on how much I eat at each meal, so I'm eating more between meals to make up for it."
The only way I know of to allocate memory without using the Java heap would be to write JNDI calls to malloc the memory with C. But then how would you use this memory? You'd have to write more JNDI calls to interact with it. I think you'd end up basically re-inventing Java.
If the goal here is to send a large file, just use buffered streams and read/write it one byte at a time. A buffered stream, as the name implies, will take care of buffering for you so you're not really hitting the hard drive one byte at a time. It will really read, I think the default is 8k at a time, and then pass these bytes to you as you ask for them. Likewise, on the write side it will save up a few kb and and send them all in chunks.
So all you should have to do is open a BufferedInputStream and a BufferedOutputStream. Then write a loop that reads one byte from the input stream and writes it to the output stream until you hit end-of-file.
Something like:
OutputStream os=... however you're getting your socket ...
BufferedInputStream bis=new BufferedInputStream(new FileInputStream(fileObject));
BufferedOutputStream bos=new BufferedOutputStream(os);
int b;
while ((b=bis.read())!=-1)
bos.write(b);
bis.close();
bos.close();
No need to make life complicated for yourself by re-inventing buffering.
while (