How to implement a buffered / batched FileChannel in Java? - java

This does not look trivial, specially for a read/write buffered FileChannel. Is there anything opensource implemented somewhere that I can base my implementation on?
To be clear for those who did not understand:
FileChannel does buffereing in the OS level and I want to do buffering in the Java level. Read here to understand: FileChannel#force and buffering
#Peter I want to write a huge file to disk from a fast message stream. Buffering and batching are the way to go. So I want to batch in Java and then call FileChannel.write.

I recommend using a BufferedOutputStream wrapping a FileOutputStream. I do not believe you will see any performance improvement by mucking with ByteBuffer and FileChannel, and that you'll be left with a lot of hard-to-maintain code if you go that route.
The reasoning is quite simple: regardless of the approach you take, the steps involved are the same:
Generate bytes. You don't say how you plan to do this, and it could introduce an additional level of temporary buffering into the equation. But regardless, the Java data has to be turned into bytes.
Accumulate bytes into a buffer. You want to buffer your data before writing it, so that you're not making lots of small writes. That's a given. But where that buffer lives is immaterial.
Move bytes from Java heap to C heap, across JNI barrier. Writing a file is a native operation, and it doesn't read directly from the Java heap. So whether you buffer on the Java heap and then move the buffered bytes, or buffer in a direct ByteBuffer (and yes, you want a direct buffer), you're still moving the bytes. You will make more JNI calls with the ByteBuffer, but that's a marginal cost.
Invoke fwrite, a kernel call that copies bytes from the C heap into a kernel-maintained disk buffer.
Write the kernel buffer to disk. This will outweigh all the other steps combined, because disks are slow.
There may be a few microseconds gained or lost depending on exactly how you implement these steps, but you can't change the basic steps.
The FileChannel does give you the option to call force(), to ensure that step #5 actually happens. This is likely to actually decrease your overall performance, as the underlying fsync call will not return until the bytes are written. And if you really want to do it, you can always get the channel from the underlying stream.
Bottom line: I'm willing to bet that you're actually IO-bound, and there's no cure for that save better hardware.

FileChannel only works with ByteBuffers so it is naturally buffered. If you need additional buffering to can copy data from ByteBuffer to ByteBuffer but I am not sure why you would want to.
FileChannel does buffereing in the OS level
FileChannel does tell the OS what to do. The OS usually has a disk cache but FileChannel has no idea whether this is the case or not.
I want to do buffering in the Java level
You are in luck, because you don't have a choice. ;) This is the only option.

I would have two threads, the producer thread produces ByteBuffers and appends them to the tail a queue, the consumer thread remove some ByteBuffers from the head of the queue each time, and call fileChannel.write(ByteBuffer[]).

Related

difference between read(ByteBuffer) and read(byte[]) in FileChannel and FileInputStream

I am newly to NIO and I find an article saying 'the block-based transmission is commonly more effective than stream-based transmission'.
It means read(ByteBuffer) is block-based transmission and read(byte[]) is stream-based transmission.
I want to know what's the internal difference between the two methods.
ps:I also hear block-based transmission is transferring byte arrays and stream-based transmission is transferring byte one by one. I think it's wrong,
because java.io.FileInputStream.read(byte[]) transfers byte array as well.
One thing that makes Bytebuffer more efficient is using direct memory. This avoids a copy from direct memory into a byte[]. If you are merely copying data from one Channel to another this can be up to 30% faster. If you reading byte by byte it can be slightly slower to use a ByteBuffer as it has more overhead accessing each byte. If you use it to read binary e.g. int or double it can be much faster as it can grab the whole value in one access.
I think you're talking about buffer-based vs stream-based I/O operations. Java NIO is buffer oriented in the sense that data is first read into a buffer which is then processed. This gives one flexibility. Also, you need to be sure that the buffer has all the data you require, before you process it. On the other hand, with stream-based I/O, you read one or more bytes from the stream, these are not cached anywhere. This is a blocking I/O, while buffer-based I/O (which is Java NIO) is a non-blocking IO.
While I wouldn't use "stream-based" to characterize read(byte[]), there are efficiency gains to a ByteBuffer over a byte[] in some cases.
See A simple rule of when I should use direct buffers with Java NIO for network I/O? and ByteBuffer.allocate() vs. ByteBuffer.allocateDirect()
The memory backing a ByteBuffer can be (if "direct") easier for the JVM to pass to the OS and to do IO tricks with (for example passing the memory directly to read to write calls), and may not be "on" the JVM's heap. The memory backing a byte[] is on the JVM heap and IO generally does not go directly into the memory used by the array (instead it often goes through a bounce buffer --- because the GC may "move" array objects around in memory while IO is pending or the array memory may not be contiguous).
However, if you have to manipulate the data in Java, a ByteBuffer may not make much difference, as you'll eventually have to copy the data into the Java heap to manipulate it. If you're doing a data copy in and back out with out manipulation, a direct ByteBuffer can be a win.

Java BufferedOutputStream: How many bytes to write

This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)

flush() java file handling

What is the exact use of flush()? What is the difference between stream and buffer? Why do we need buffer?
The advantage of buffering is efficiency. It is generally faster to write a block of 4096 bytes one time to a file than to write, say, one byte 4096 times.
The disadvantage of buffering is that you miss out on the feedback. Output to a handle can remain in memory until enough bytes are written to make it worthwhile to write to the file handle. One part of your program may write some data to a file, but a different part of the program or a different program can't access that data until the first part of your program copies the data from memory to disk. Depending on how quickly data is being written to that file, this can take an arbitrarily long time.
When you call flush(), you are asking the OS to immediately write out whatever data is in the buffer to the file handle, even if the buffer is not full.
The data sometimes gets cached before it's actually written to disk (in a buffer) flush causes what's in the buffer to be written to disk.
flush tells an output stream to send all the data to the underlying stream. It's necessary because of internal buffering. The essential purpose of a buffer is to minimize calls to the underlying stream's APIs. If I'm storing a long byte array to a FileOutputStream, I don't want Java to call the operating system file API once per byte. Thus, buffers are used at various stages, both inside and outside Java. Even if you did call fputc once per byte, the OS wouldn't really write to disk each time, because it has its own buffering.

How to allocate the memory from OS instead of increasing the JVM’s heap size?

I need to detect whether the file I am attaching to an email is exceeding the server limit. I am not allowed to increase the JVM heap size to do this since it is going to affect the application performance.
If I don’t increase the JVM heap size, I will run into OutOfMemoryError directly.
I would like to know how do allocate the memory from OS instead of increasing the JVM’s heap size?
Thanks a lot!
Are you really trying to read the entire file to determine its size to check if it is less than some configured value (your question is not too easy to understand)? If so, why are you not using File#length() instead?
If you need to stream the file to the server in order to find out whether it's too big, you still don't need to read the whole file into memory.
Instead, read maybe 10-100k into memory. Fill the buffer, send it to the server. Repeat until the file is done or the server complains. Then you don't need enough memory for the whole file.
If you write your own stream handling code, you could create your own counter to track the number of bytes transmitted. I'd be surprised if there wasn't already some sort of Filter class that does this for you. Sun has a page about this. Search for 'CountReader'.
You could allocate the memory natively via native code and JNI. However that sounds a painful way to do this.
Instead can't you give the JVM suitable memory configurations (via -Xmx) ? If the document you're mailing is so big that you can't easily handle it, then I'm not sure email is the correct medium to transfer it, and you should instead host it and send a link to it, or perhaps FTP it.
If all the other solutions turn out to be unusable (and I would encourage you to find a better way than requiring the entire file to fit in memory!) you could consider using a direct ByteBuffer. It has the option of using mmap() or other system calls to map a file into your memory without actually reading / allocating space in the heap. You can do this by calling map() on a FileChannel -- API documentation. Note that this is potentially expensive and/or not supported on some platforms, so it should be considered suboptimal compared to any solution which does not require the entire file to be in memory.
Socket s = /* go get your socket to the server */
InputStream is = new FileInputStream("foo.txt");
OutputStream os = s.getOutputStream();
byte[] buf = new byte[4096];
for(int len=-1;(len=is.read(buf))!=-1;) os.write(buf,0,len);
os.close();
is.close();
Of course handle your Exceptions.
If you're not allowed to increase the heap size because of memory constaints, doing an "under the table" memory allocation would cause the same problems. It sounds like you're looking for a loophole in the rules. Like, "My doctor says to cut down on how much I eat at each meal, so I'm eating more between meals to make up for it."
The only way I know of to allocate memory without using the Java heap would be to write JNDI calls to malloc the memory with C. But then how would you use this memory? You'd have to write more JNDI calls to interact with it. I think you'd end up basically re-inventing Java.
If the goal here is to send a large file, just use buffered streams and read/write it one byte at a time. A buffered stream, as the name implies, will take care of buffering for you so you're not really hitting the hard drive one byte at a time. It will really read, I think the default is 8k at a time, and then pass these bytes to you as you ask for them. Likewise, on the write side it will save up a few kb and and send them all in chunks.
So all you should have to do is open a BufferedInputStream and a BufferedOutputStream. Then write a loop that reads one byte from the input stream and writes it to the output stream until you hit end-of-file.
Something like:
OutputStream os=... however you're getting your socket ...
BufferedInputStream bis=new BufferedInputStream(new FileInputStream(fileObject));
BufferedOutputStream bos=new BufferedOutputStream(os);
int b;
while ((b=bis.read())!=-1)
bos.write(b);
bis.close();
bos.close();
No need to make life complicated for yourself by re-inventing buffering.
while (

Buffer a large file; BufferedInputStream limited to 2gb; Arrays limited to 2^31 bytes

I am sequentially processing a large file and I'd like to keep a large chunk of it in memory, 16gb ram available on a 64 bit system.
A quick and dirty way is to do this, is simply wrap the input stream into a buffered input stream, unfortunately, this only gives me a 2gb buffer. I'd like to have more of it in memory, what alternatives do I have?
How about letting the OS deal with the buffering of the file? Have you checked what the performance impact of not copying the whole file into JVMs memory is?
EDIT: You could then use either RandomAccessFile or the FileChannel to efficiently read the necessary parts of the file into the JVMs memory.
Have you considered the MappedByteBuffer in java.nio? It's over my head but maybe it is what you are looking for.
I doubt that buffering more than 2gb at a time is going to be a huge win anyway. Depending on the amount of processing you're doing, you might be able to read in nearly as fast as you process. To speed it up, you might try using a two-threaded producer-consumer model (one thread reads the file and hands the data off to the other thread for processing).
The OS is going to cache as much of the file as it can, so trying to outsmart the cache manager probably isn't going to get you very much.
From a performance perspective, you will be much better served by keeping the bytes outside the JVM (transferring huge chunks of data between the OS and JVM is relatively slow). You can achieve this goal by using a MappedByteBuffer backed by a direct memory block.
Here's a pertinent how-to type of article: article
I think there are 64 bit JVMs that will support nonstandard limits.
You might try buffering chunks.

Categories