difference between read(ByteBuffer) and read(byte[]) in FileChannel and FileInputStream - java

I am newly to NIO and I find an article saying 'the block-based transmission is commonly more effective than stream-based transmission'.
It means read(ByteBuffer) is block-based transmission and read(byte[]) is stream-based transmission.
I want to know what's the internal difference between the two methods.
ps:I also hear block-based transmission is transferring byte arrays and stream-based transmission is transferring byte one by one. I think it's wrong,
because java.io.FileInputStream.read(byte[]) transfers byte array as well.

One thing that makes Bytebuffer more efficient is using direct memory. This avoids a copy from direct memory into a byte[]. If you are merely copying data from one Channel to another this can be up to 30% faster. If you reading byte by byte it can be slightly slower to use a ByteBuffer as it has more overhead accessing each byte. If you use it to read binary e.g. int or double it can be much faster as it can grab the whole value in one access.

I think you're talking about buffer-based vs stream-based I/O operations. Java NIO is buffer oriented in the sense that data is first read into a buffer which is then processed. This gives one flexibility. Also, you need to be sure that the buffer has all the data you require, before you process it. On the other hand, with stream-based I/O, you read one or more bytes from the stream, these are not cached anywhere. This is a blocking I/O, while buffer-based I/O (which is Java NIO) is a non-blocking IO.

While I wouldn't use "stream-based" to characterize read(byte[]), there are efficiency gains to a ByteBuffer over a byte[] in some cases.
See A simple rule of when I should use direct buffers with Java NIO for network I/O? and ByteBuffer.allocate() vs. ByteBuffer.allocateDirect()
The memory backing a ByteBuffer can be (if "direct") easier for the JVM to pass to the OS and to do IO tricks with (for example passing the memory directly to read to write calls), and may not be "on" the JVM's heap. The memory backing a byte[] is on the JVM heap and IO generally does not go directly into the memory used by the array (instead it often goes through a bounce buffer --- because the GC may "move" array objects around in memory while IO is pending or the array memory may not be contiguous).
However, if you have to manipulate the data in Java, a ByteBuffer may not make much difference, as you'll eventually have to copy the data into the Java heap to manipulate it. If you're doing a data copy in and back out with out manipulation, a direct ByteBuffer can be a win.

Related

JVM : When does JVM need to copy memory content

I just read a wiki here, one of the passages said :
Although theoretically these are general-purpose data structures, the
implementation may select memory for alignment or paging
characteristics, which are not otherwise accessible in Java.
Typically, this would be used to allow the buffer contents to occupy
the same physical memory used by the underlying operating system for
its native I/O operations, thus allowing the most direct transfer
mechanism, and eliminating the need for any additional copying
I am curious about the words "eliminating the need for any additional copying", when will JVM need this and why NIO could avoid it ?
It's talking about a direct mapping between a kernel data structure and a user space data structure; normally a context switch is required when moving between the two. However, with nio and a direct buffer, the context switch (and corresponding memory copies) does not occur.
From java.nio package API:
A byte buffer can be allocated as a direct buffer, in which case the Java virtual machine will make a best effort to perform native I/O operations directly upon it.
Example:
FileChannel fc = ...
ByteBuffer buf = ByteBuffer.allocateDirect(8192);
int n = fc.read(buf);
simply, old IO way always copy data from the kernel to memory in the heap. Using NIO allows to use buffers where file/network stream is mapped by the kernel directly. Result: less memory consumption and far better performance.
Many developers know only a single JVM, the Oracle HotSpot JVM, and speak of garbage collection in general when they are referring to Oracle’s HotSpot implementation specifically. but the thing is check Bob's post
New input/output (NIO) library, introduced with JDK 1.4, provides high-speed, block-oriented I/O in standard Java code.
Few points on NIO,
IO is stream oriented, where NIO is buffer oriented.
Offer non-blocking I/O operations
Avoid an extra copy of data passed between Java and native memory
Allows to read and write blocks of
data direct from disk, rather than byte by byte
The NIO API introduces a new primitive I/O abstraction called channel. A channel represents an open connection to an entity such as a hardware device, a file, a network socket.
When you are using APIs FileChannel.transferTo() or FileChannel.transferFrom() JVM uses the OS's access to DMA (Direct Memory Access) which is potential advantage.
According to Ron Hitches on Java NIO
Direct buffers are intended for interaction with channels and native
I/O routines. They make a best effort to store the byte elements in a
memory area that a channel can use for direct, or raw, access by using
native code to tell the operating system to drain or fill the memory
area directly.
Direct byte buffers are usually the best choice for I/O operations. By
design, they support the most efficient I/O mechanism available to the
JVM. Nondirect byte buffers can be passed to channels, but doing so
may incur a performance penalty. It's usually not possible for a
nondirect buffer to be the target of a native I/O operation.
Direct buffers are optimal for I/O, but they may be more expensive to
create than nondirect byte buffers. The memory used by direct buffers
is allocated by calling through to native, operating system-specific
code, bypassing the standard JVM heap. Setting up and tearing down
direct buffers could be significantly more expensive than
heap-resident buffers, depending on the host operating system and JVM
implementation. The memory-storage areas of direct buffers are not
subject to garbage collection because they are outside the standard
JVM heap
Chapter 2 on below tutorial will give you more insight ( especially 2.4, 2.4.2 etc)
http://blogimg.chinaunix.net/blog/upfile2/090901134800.pdf

How to implement a buffered / batched FileChannel in Java?

This does not look trivial, specially for a read/write buffered FileChannel. Is there anything opensource implemented somewhere that I can base my implementation on?
To be clear for those who did not understand:
FileChannel does buffereing in the OS level and I want to do buffering in the Java level. Read here to understand: FileChannel#force and buffering
#Peter I want to write a huge file to disk from a fast message stream. Buffering and batching are the way to go. So I want to batch in Java and then call FileChannel.write.
I recommend using a BufferedOutputStream wrapping a FileOutputStream. I do not believe you will see any performance improvement by mucking with ByteBuffer and FileChannel, and that you'll be left with a lot of hard-to-maintain code if you go that route.
The reasoning is quite simple: regardless of the approach you take, the steps involved are the same:
Generate bytes. You don't say how you plan to do this, and it could introduce an additional level of temporary buffering into the equation. But regardless, the Java data has to be turned into bytes.
Accumulate bytes into a buffer. You want to buffer your data before writing it, so that you're not making lots of small writes. That's a given. But where that buffer lives is immaterial.
Move bytes from Java heap to C heap, across JNI barrier. Writing a file is a native operation, and it doesn't read directly from the Java heap. So whether you buffer on the Java heap and then move the buffered bytes, or buffer in a direct ByteBuffer (and yes, you want a direct buffer), you're still moving the bytes. You will make more JNI calls with the ByteBuffer, but that's a marginal cost.
Invoke fwrite, a kernel call that copies bytes from the C heap into a kernel-maintained disk buffer.
Write the kernel buffer to disk. This will outweigh all the other steps combined, because disks are slow.
There may be a few microseconds gained or lost depending on exactly how you implement these steps, but you can't change the basic steps.
The FileChannel does give you the option to call force(), to ensure that step #5 actually happens. This is likely to actually decrease your overall performance, as the underlying fsync call will not return until the bytes are written. And if you really want to do it, you can always get the channel from the underlying stream.
Bottom line: I'm willing to bet that you're actually IO-bound, and there's no cure for that save better hardware.
FileChannel only works with ByteBuffers so it is naturally buffered. If you need additional buffering to can copy data from ByteBuffer to ByteBuffer but I am not sure why you would want to.
FileChannel does buffereing in the OS level
FileChannel does tell the OS what to do. The OS usually has a disk cache but FileChannel has no idea whether this is the case or not.
I want to do buffering in the Java level
You are in luck, because you don't have a choice. ;) This is the only option.
I would have two threads, the producer thread produces ByteBuffers and appends them to the tail a queue, the consumer thread remove some ByteBuffers from the head of the queue each time, and call fileChannel.write(ByteBuffer[]).

Java BufferedOutputStream: How many bytes to write

This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)

Android: Most efficient way to pass some read-only bytes to native C++

I have an Android project (targeting Android 1.6 and up) which includes native code written in C/C++, accessed via NDK. I'm wondering what the most efficient way is to pass an array of bytes from Java through NDK to my JNI glue layer. My concern is around whether or not NDK for Android will copy the array of bytes, or just give me a direct reference. I need read-only access to the bytes at the C++ level, so any copying behind the scenes would be a waste of time from my perspective.
It's easy to find info about this on the web, but I'm not sure what is the most pertinent info. Examples:
Get the pointer of a Java ByteBuffer though JNI
http://www.milk.com/kodebase/dalvik-docs-mirror/docs/jni-tips.html
http://elliotth.blogspot.com/2007/03/optimizing-jni-array-access.html
So does anyone know what is the best (most efficient, least copying) way to do this in the current NDK? GetByteArrayRegion? GetByteArrayElements? Something else?
According to the documentation, GetDirectBufferAddress will give you the reference without copying the array.
However, to call this function you need to allocate a direct buffer with ByteBuffer.allocateDirect() instead of a simple byte array. It has a counterpart as explained here :
A direct byte buffer may be created by invoking the allocateDirect
factory method of this class. The buffers returned by this method
typically have somewhat higher allocation and deallocation costs than
non-direct buffers. The contents of direct buffers may reside outside
of the normal garbage-collected heap, and so their impact upon the
memory footprint of an application might not be obvious. It is
therefore recommended that direct buffers be allocated primarily for
large, long-lived buffers that are subject to the underlying system's
native I/O operations. In general it is best to allocate direct
buffers only when they yield a measureable gain in program
performance.

How to avoid OutOfMemoryError when using Bytebuffers and NIO?

I'm using ByteBuffers and FileChannels to write binary data to a file. When doing that for big files or successively for multiple files, I get an OutOfMemoryError exception.
I've read elsewhere that using Bytebuffers with NIO is broken and should be avoided. Does any of you already faced this kind of problem and found a solution to efficiently save large amounts of binary data in a file in java?
Is the jvm option -XX:MaxDirectMemorySize the way to go?
I would say don't create a huge ByteBuffer that contains ALL of the data at once. Create a much smaller ByteBuffer, fill it with data, then write this data to the FileChannel. Then reset the ByteBuffer and continue until all the data is written.
Check out Java's Mapped Byte Buffers, also known as 'direct buffers'. Basically, this mechanism uses the OS's virtual memory paging system to 'map' your buffer directly to disk. The OS will manage moving the bytes to/from disk and memory auto-magically, very quickly, and you won't have to worry about changing your virtual machine options. This will also allow you to take advantage of NIO's improved performance over traditional java stream-based i/o, without any weird hacks.
The only two catches that I can think of are:
On 32-bit system, you are limited to just under 4GB total for all mapped byte buffers. (That is actually a limit for my application, and I now run on 64-bit architectures.)
Implementation is JVM specific and not a requirement. I use Sun's JVM and there are no problems, but YMMV.
Kirk Pepperdine (a somewhat famous Java performance guru) is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips
If you access files in a random fashion (read here, skip, write there, move back) then you have a problem ;-)
But if you only write big files, you should seriously consider using streams. java.io.FileOutputStream can be used directly to write file byte after byte or wrapped in any other stream (i.e. DataOutputStream, ObjectOutputStream) for convenience of writing floats, ints, Strings or even serializeable objects. Similar classes exist for reading files.
Streams offer you convenience of manipulating arbitrarily large files in (almost) arbitrarily small memory. They are preferred way of accessing file system in vast majority of cases.
Using the transferFrom method should help with this, assuming you write to the channel incrementally and not all at once as previous answers also point out.
This can depend on the particular JDK vendor and version.
There is a bug in GC in some Sun JVMs. Shortages of direct memory will not trigger a GC in the main heap, but the direct memory is pinned down by garbage direct ByteBuffers in the main heap. If the main heap is mostly empty they many not be collected for a long time.
This can burn you even if you aren't using direct buffers on your own, because the JVM may be creating direct buffers on your behalf. For instance, writing a non-direct ByteBuffer to a SocketChannel creates a direct buffer under the covers to use for the actual I/O operation.
The workaround is to use a small number of direct buffers yourself, and keep them around for reuse.
The previous two responses seem pretty reasonable. As for whether the command line switch will work, it depends how quickly your memory usage hits the limit. If you don't have enough ram and virtual memory available to at least triple the memory available, then you will need to use one of the alternate suggestions given.

Categories