This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)
Related
I am newly to NIO and I find an article saying 'the block-based transmission is commonly more effective than stream-based transmission'.
It means read(ByteBuffer) is block-based transmission and read(byte[]) is stream-based transmission.
I want to know what's the internal difference between the two methods.
ps:I also hear block-based transmission is transferring byte arrays and stream-based transmission is transferring byte one by one. I think it's wrong,
because java.io.FileInputStream.read(byte[]) transfers byte array as well.
One thing that makes Bytebuffer more efficient is using direct memory. This avoids a copy from direct memory into a byte[]. If you are merely copying data from one Channel to another this can be up to 30% faster. If you reading byte by byte it can be slightly slower to use a ByteBuffer as it has more overhead accessing each byte. If you use it to read binary e.g. int or double it can be much faster as it can grab the whole value in one access.
I think you're talking about buffer-based vs stream-based I/O operations. Java NIO is buffer oriented in the sense that data is first read into a buffer which is then processed. This gives one flexibility. Also, you need to be sure that the buffer has all the data you require, before you process it. On the other hand, with stream-based I/O, you read one or more bytes from the stream, these are not cached anywhere. This is a blocking I/O, while buffer-based I/O (which is Java NIO) is a non-blocking IO.
While I wouldn't use "stream-based" to characterize read(byte[]), there are efficiency gains to a ByteBuffer over a byte[] in some cases.
See A simple rule of when I should use direct buffers with Java NIO for network I/O? and ByteBuffer.allocate() vs. ByteBuffer.allocateDirect()
The memory backing a ByteBuffer can be (if "direct") easier for the JVM to pass to the OS and to do IO tricks with (for example passing the memory directly to read to write calls), and may not be "on" the JVM's heap. The memory backing a byte[] is on the JVM heap and IO generally does not go directly into the memory used by the array (instead it often goes through a bounce buffer --- because the GC may "move" array objects around in memory while IO is pending or the array memory may not be contiguous).
However, if you have to manipulate the data in Java, a ByteBuffer may not make much difference, as you'll eventually have to copy the data into the Java heap to manipulate it. If you're doing a data copy in and back out with out manipulation, a direct ByteBuffer can be a win.
This does not look trivial, specially for a read/write buffered FileChannel. Is there anything opensource implemented somewhere that I can base my implementation on?
To be clear for those who did not understand:
FileChannel does buffereing in the OS level and I want to do buffering in the Java level. Read here to understand: FileChannel#force and buffering
#Peter I want to write a huge file to disk from a fast message stream. Buffering and batching are the way to go. So I want to batch in Java and then call FileChannel.write.
I recommend using a BufferedOutputStream wrapping a FileOutputStream. I do not believe you will see any performance improvement by mucking with ByteBuffer and FileChannel, and that you'll be left with a lot of hard-to-maintain code if you go that route.
The reasoning is quite simple: regardless of the approach you take, the steps involved are the same:
Generate bytes. You don't say how you plan to do this, and it could introduce an additional level of temporary buffering into the equation. But regardless, the Java data has to be turned into bytes.
Accumulate bytes into a buffer. You want to buffer your data before writing it, so that you're not making lots of small writes. That's a given. But where that buffer lives is immaterial.
Move bytes from Java heap to C heap, across JNI barrier. Writing a file is a native operation, and it doesn't read directly from the Java heap. So whether you buffer on the Java heap and then move the buffered bytes, or buffer in a direct ByteBuffer (and yes, you want a direct buffer), you're still moving the bytes. You will make more JNI calls with the ByteBuffer, but that's a marginal cost.
Invoke fwrite, a kernel call that copies bytes from the C heap into a kernel-maintained disk buffer.
Write the kernel buffer to disk. This will outweigh all the other steps combined, because disks are slow.
There may be a few microseconds gained or lost depending on exactly how you implement these steps, but you can't change the basic steps.
The FileChannel does give you the option to call force(), to ensure that step #5 actually happens. This is likely to actually decrease your overall performance, as the underlying fsync call will not return until the bytes are written. And if you really want to do it, you can always get the channel from the underlying stream.
Bottom line: I'm willing to bet that you're actually IO-bound, and there's no cure for that save better hardware.
FileChannel only works with ByteBuffers so it is naturally buffered. If you need additional buffering to can copy data from ByteBuffer to ByteBuffer but I am not sure why you would want to.
FileChannel does buffereing in the OS level
FileChannel does tell the OS what to do. The OS usually has a disk cache but FileChannel has no idea whether this is the case or not.
I want to do buffering in the Java level
You are in luck, because you don't have a choice. ;) This is the only option.
I would have two threads, the producer thread produces ByteBuffers and appends them to the tail a queue, the consumer thread remove some ByteBuffers from the head of the queue each time, and call fileChannel.write(ByteBuffer[]).
What is the exact use of flush()? What is the difference between stream and buffer? Why do we need buffer?
The advantage of buffering is efficiency. It is generally faster to write a block of 4096 bytes one time to a file than to write, say, one byte 4096 times.
The disadvantage of buffering is that you miss out on the feedback. Output to a handle can remain in memory until enough bytes are written to make it worthwhile to write to the file handle. One part of your program may write some data to a file, but a different part of the program or a different program can't access that data until the first part of your program copies the data from memory to disk. Depending on how quickly data is being written to that file, this can take an arbitrarily long time.
When you call flush(), you are asking the OS to immediately write out whatever data is in the buffer to the file handle, even if the buffer is not full.
The data sometimes gets cached before it's actually written to disk (in a buffer) flush causes what's in the buffer to be written to disk.
flush tells an output stream to send all the data to the underlying stream. It's necessary because of internal buffering. The essential purpose of a buffer is to minimize calls to the underlying stream's APIs. If I'm storing a long byte array to a FileOutputStream, I don't want Java to call the operating system file API once per byte. Thus, buffers are used at various stages, both inside and outside Java. Even if you did call fputc once per byte, the OS wouldn't really write to disk each time, because it has its own buffering.
I need to detect whether the file I am attaching to an email is exceeding the server limit. I am not allowed to increase the JVM heap size to do this since it is going to affect the application performance.
If I don’t increase the JVM heap size, I will run into OutOfMemoryError directly.
I would like to know how do allocate the memory from OS instead of increasing the JVM’s heap size?
Thanks a lot!
Are you really trying to read the entire file to determine its size to check if it is less than some configured value (your question is not too easy to understand)? If so, why are you not using File#length() instead?
If you need to stream the file to the server in order to find out whether it's too big, you still don't need to read the whole file into memory.
Instead, read maybe 10-100k into memory. Fill the buffer, send it to the server. Repeat until the file is done or the server complains. Then you don't need enough memory for the whole file.
If you write your own stream handling code, you could create your own counter to track the number of bytes transmitted. I'd be surprised if there wasn't already some sort of Filter class that does this for you. Sun has a page about this. Search for 'CountReader'.
You could allocate the memory natively via native code and JNI. However that sounds a painful way to do this.
Instead can't you give the JVM suitable memory configurations (via -Xmx) ? If the document you're mailing is so big that you can't easily handle it, then I'm not sure email is the correct medium to transfer it, and you should instead host it and send a link to it, or perhaps FTP it.
If all the other solutions turn out to be unusable (and I would encourage you to find a better way than requiring the entire file to fit in memory!) you could consider using a direct ByteBuffer. It has the option of using mmap() or other system calls to map a file into your memory without actually reading / allocating space in the heap. You can do this by calling map() on a FileChannel -- API documentation. Note that this is potentially expensive and/or not supported on some platforms, so it should be considered suboptimal compared to any solution which does not require the entire file to be in memory.
Socket s = /* go get your socket to the server */
InputStream is = new FileInputStream("foo.txt");
OutputStream os = s.getOutputStream();
byte[] buf = new byte[4096];
for(int len=-1;(len=is.read(buf))!=-1;) os.write(buf,0,len);
os.close();
is.close();
Of course handle your Exceptions.
If you're not allowed to increase the heap size because of memory constaints, doing an "under the table" memory allocation would cause the same problems. It sounds like you're looking for a loophole in the rules. Like, "My doctor says to cut down on how much I eat at each meal, so I'm eating more between meals to make up for it."
The only way I know of to allocate memory without using the Java heap would be to write JNDI calls to malloc the memory with C. But then how would you use this memory? You'd have to write more JNDI calls to interact with it. I think you'd end up basically re-inventing Java.
If the goal here is to send a large file, just use buffered streams and read/write it one byte at a time. A buffered stream, as the name implies, will take care of buffering for you so you're not really hitting the hard drive one byte at a time. It will really read, I think the default is 8k at a time, and then pass these bytes to you as you ask for them. Likewise, on the write side it will save up a few kb and and send them all in chunks.
So all you should have to do is open a BufferedInputStream and a BufferedOutputStream. Then write a loop that reads one byte from the input stream and writes it to the output stream until you hit end-of-file.
Something like:
OutputStream os=... however you're getting your socket ...
BufferedInputStream bis=new BufferedInputStream(new FileInputStream(fileObject));
BufferedOutputStream bos=new BufferedOutputStream(os);
int b;
while ((b=bis.read())!=-1)
bos.write(b);
bis.close();
bos.close();
No need to make life complicated for yourself by re-inventing buffering.
while (
I'm using ByteBuffers and FileChannels to write binary data to a file. When doing that for big files or successively for multiple files, I get an OutOfMemoryError exception.
I've read elsewhere that using Bytebuffers with NIO is broken and should be avoided. Does any of you already faced this kind of problem and found a solution to efficiently save large amounts of binary data in a file in java?
Is the jvm option -XX:MaxDirectMemorySize the way to go?
I would say don't create a huge ByteBuffer that contains ALL of the data at once. Create a much smaller ByteBuffer, fill it with data, then write this data to the FileChannel. Then reset the ByteBuffer and continue until all the data is written.
Check out Java's Mapped Byte Buffers, also known as 'direct buffers'. Basically, this mechanism uses the OS's virtual memory paging system to 'map' your buffer directly to disk. The OS will manage moving the bytes to/from disk and memory auto-magically, very quickly, and you won't have to worry about changing your virtual machine options. This will also allow you to take advantage of NIO's improved performance over traditional java stream-based i/o, without any weird hacks.
The only two catches that I can think of are:
On 32-bit system, you are limited to just under 4GB total for all mapped byte buffers. (That is actually a limit for my application, and I now run on 64-bit architectures.)
Implementation is JVM specific and not a requirement. I use Sun's JVM and there are no problems, but YMMV.
Kirk Pepperdine (a somewhat famous Java performance guru) is involved with a website, www.JavaPerformanceTuning.com, that has some more MBB details: NIO Performance Tips
If you access files in a random fashion (read here, skip, write there, move back) then you have a problem ;-)
But if you only write big files, you should seriously consider using streams. java.io.FileOutputStream can be used directly to write file byte after byte or wrapped in any other stream (i.e. DataOutputStream, ObjectOutputStream) for convenience of writing floats, ints, Strings or even serializeable objects. Similar classes exist for reading files.
Streams offer you convenience of manipulating arbitrarily large files in (almost) arbitrarily small memory. They are preferred way of accessing file system in vast majority of cases.
Using the transferFrom method should help with this, assuming you write to the channel incrementally and not all at once as previous answers also point out.
This can depend on the particular JDK vendor and version.
There is a bug in GC in some Sun JVMs. Shortages of direct memory will not trigger a GC in the main heap, but the direct memory is pinned down by garbage direct ByteBuffers in the main heap. If the main heap is mostly empty they many not be collected for a long time.
This can burn you even if you aren't using direct buffers on your own, because the JVM may be creating direct buffers on your behalf. For instance, writing a non-direct ByteBuffer to a SocketChannel creates a direct buffer under the covers to use for the actual I/O operation.
The workaround is to use a small number of direct buffers yourself, and keep them around for reuse.
The previous two responses seem pretty reasonable. As for whether the command line switch will work, it depends how quickly your memory usage hits the limit. If you don't have enough ram and virtual memory available to at least triple the memory available, then you will need to use one of the alternate suggestions given.