flush() java file handling - java

What is the exact use of flush()? What is the difference between stream and buffer? Why do we need buffer?

The advantage of buffering is efficiency. It is generally faster to write a block of 4096 bytes one time to a file than to write, say, one byte 4096 times.
The disadvantage of buffering is that you miss out on the feedback. Output to a handle can remain in memory until enough bytes are written to make it worthwhile to write to the file handle. One part of your program may write some data to a file, but a different part of the program or a different program can't access that data until the first part of your program copies the data from memory to disk. Depending on how quickly data is being written to that file, this can take an arbitrarily long time.
When you call flush(), you are asking the OS to immediately write out whatever data is in the buffer to the file handle, even if the buffer is not full.

The data sometimes gets cached before it's actually written to disk (in a buffer) flush causes what's in the buffer to be written to disk.

flush tells an output stream to send all the data to the underlying stream. It's necessary because of internal buffering. The essential purpose of a buffer is to minimize calls to the underlying stream's APIs. If I'm storing a long byte array to a FileOutputStream, I don't want Java to call the operating system file API once per byte. Thus, buffers are used at various stages, both inside and outside Java. Even if you did call fputc once per byte, the OS wouldn't really write to disk each time, because it has its own buffering.

Related

Okio vs java.io performance

I read the following blog:
https://medium.com/#jerzy.chalupski/a-closer-look-at-the-okio-library-90336e37261
It is said that"
the Sinks and Sources are often connected into a pipe. Smart folks at Square realized that there’s no need to copy the data between such pipe components like the java.io buffered streams do. All Sources and Sinks use Buffers under the hood, and Buffers keep the data in Segments, so quite often you can just take an entire Segment from one Buffer and move it to another."
I just dont understand where is the copy of data in java.io.
And in which case a Segment would be moved to another Buffer.
After i read source code of Okio. If writing strings to file by Okio like the following:
val sink = logFile.appendingSink().buffer()
sink.writeUtf8("xxxx")
there will be no "moving segment to another Buffer". Am i right?
Java's BufferedReader is just a Reader that buffers data into a buffer – the buffer being a char[], or something like that – so that every time you need a bunch of bytes/chars from it, it doesn't need to read bytes from a file/network/whatever your byte source is (as long as it has buffered enough bytes). A BufferedWriter does the opposite operation: whenever you write a bunch of bytes to the BufferedWriter, it doesn't actually write bytes to a file/socket/whatever, but it "parks" them into a buffer, so that it can flush the buffer only once it's full.
Overall, this minimises access to file/network/whatever, as it could be expensive.
When you pipe a BufferedReader to a BufferedWriter, you effectively have 2 buffers. How does Java move bytes from one buffer to the other? It copies them from the source to the sink using System.arraycopy (or something equivalent). Everything works well, except that copying a bunch of bytes requires an amount of time that grows linearly as the size of the buffer(s) grow. Hence, copying 1 MB will take roughly 1000 times more than copying 1 KB.
Okio, on the other hand, doesn't actually copy bytes. Oversimplifying the way it works, Okio has a single byte[] with the actual bytes, and the only thing that gets moved from the source to the sink is the pointer (or reference) to that byte[], which requires the same amount of time regardless of its size.

nature of input/output streams in a servlet

I'm able to extract contents from the request object through an input stream. So if it is a stream, does it mean the data is being transferred 'live' from the client to the servlet through os->webcontainer->etc etc ?
If I pass large amount of data in the request, does it get cached somewhere at the OS/JVM or is it being read directly from the source live? Can I open a request inputStream to tera/peta bytes of data, and write it to an outputstream byte by byte without any problems (ignoring the amount of time it would take and time outs) ?
Update if they are getting cached, why are they streams? which can be read only just one time (and need to be stored) once opened, instead they should be available to be read as many times as needed.
Just random queries, no practical use.
They're not cached. If something's been cached, it's available for re-use. Those streams are however not reusable and thus definitely not cached.
However, it's quite possible that they're buffered in memory or even on local disk file system instead of memory. This is fully to decision of the server implementation and even the underlying operating system (also known as "virtual disk" or "swap disk", depending on the operating system used). This buffer is however usually not as large as orders of magnitude of a megabyte. For example, the standard Java SE BufferedInputStream class has an internal buffer of 8KB.
Can I open a request inputStream to tera/peta bytes of data, and write it to an outputstream byte by byte without any problems (ignoring the amount of time it would take and time outs) ?
You may hit the HTTP POST size limit which is usually configurable on the server. This defaults in for example Tomcat to 2GB, but can entirely be disabled. See also the maxPostSize setting on the HTTP connector.

How to implement a buffered / batched FileChannel in Java?

This does not look trivial, specially for a read/write buffered FileChannel. Is there anything opensource implemented somewhere that I can base my implementation on?
To be clear for those who did not understand:
FileChannel does buffereing in the OS level and I want to do buffering in the Java level. Read here to understand: FileChannel#force and buffering
#Peter I want to write a huge file to disk from a fast message stream. Buffering and batching are the way to go. So I want to batch in Java and then call FileChannel.write.
I recommend using a BufferedOutputStream wrapping a FileOutputStream. I do not believe you will see any performance improvement by mucking with ByteBuffer and FileChannel, and that you'll be left with a lot of hard-to-maintain code if you go that route.
The reasoning is quite simple: regardless of the approach you take, the steps involved are the same:
Generate bytes. You don't say how you plan to do this, and it could introduce an additional level of temporary buffering into the equation. But regardless, the Java data has to be turned into bytes.
Accumulate bytes into a buffer. You want to buffer your data before writing it, so that you're not making lots of small writes. That's a given. But where that buffer lives is immaterial.
Move bytes from Java heap to C heap, across JNI barrier. Writing a file is a native operation, and it doesn't read directly from the Java heap. So whether you buffer on the Java heap and then move the buffered bytes, or buffer in a direct ByteBuffer (and yes, you want a direct buffer), you're still moving the bytes. You will make more JNI calls with the ByteBuffer, but that's a marginal cost.
Invoke fwrite, a kernel call that copies bytes from the C heap into a kernel-maintained disk buffer.
Write the kernel buffer to disk. This will outweigh all the other steps combined, because disks are slow.
There may be a few microseconds gained or lost depending on exactly how you implement these steps, but you can't change the basic steps.
The FileChannel does give you the option to call force(), to ensure that step #5 actually happens. This is likely to actually decrease your overall performance, as the underlying fsync call will not return until the bytes are written. And if you really want to do it, you can always get the channel from the underlying stream.
Bottom line: I'm willing to bet that you're actually IO-bound, and there's no cure for that save better hardware.
FileChannel only works with ByteBuffers so it is naturally buffered. If you need additional buffering to can copy data from ByteBuffer to ByteBuffer but I am not sure why you would want to.
FileChannel does buffereing in the OS level
FileChannel does tell the OS what to do. The OS usually has a disk cache but FileChannel has no idea whether this is the case or not.
I want to do buffering in the Java level
You are in luck, because you don't have a choice. ;) This is the only option.
I would have two threads, the producer thread produces ByteBuffers and appends them to the tail a queue, the consumer thread remove some ByteBuffers from the head of the queue each time, and call fileChannel.write(ByteBuffer[]).

Java BufferedOutputStream: How many bytes to write

This is more like a matter of conscience than a technological issue :p
I'm writing some java code to dowload files from a server...For that, i'm using the BufferedOutputStream method write(), and BufferedInputStream method read().
So my question is, if i use a buffer to hold the bytes, what should be the number of bytes to read? Sure i can read byte to byte using just int byte = read() and then write(byte), or i could use a buffer. If i take the second approach, is there any aspects that i must pay attention when defining the number of bytes to read\write each time? What will this number affect in my program?
Thks
Unless you have a really fast network connection, the size of the buffer will make little difference. I'd say that 4k buffers would be fine, though there's no harm in using buffers a bit bigger.
The same probably applies to using read() versus read(byte[]) ... assuming that you are using a BufferedInputStream.
Unless you have an extraordinarily fast / low-latency network connection, the bottleneck is going to be the data rate that the network and your computers' network interfaces can sustain. For a typical internet connection, the application can move the data two or more orders of magnitude of times faster than the network can. So unless you do something silly (like doing 1 byte reads on an unbuffered stream), your Java code won't be the bottleneck.
BufferedInputStream and BufferedOutputStream typically rely on System.arraycopy for their implementations. System.arraycopy has a native implementation, which likely relies on memmove or bcopy. The amount of memory that is copied will depend on the available space in your buffer, but regardless, the implementation down to the native code is pretty efficient, unlikely to affect the performance of your application regardless of how many bytes you are reading/writing.
However, with respect to BufferedInputStream, if you set a mark with a high limit, a new internal buffer may need to be created. If you do use a mark, reading more bytes than are available in the old buffer may cause a temporary performance hit, though the amortized performance is still linear.
As Stephen C mentioned, you are more likely to see performance issues due to the network.
What is the MTU(maximum traffic unit) in your network connection? If you using UDP for example, you can check this value and use smaller array of bytes. If this is no metter, you need to check how memory eats your program. I think 1024 - 4096 will be good variant to save this data and continue to receive
If you pump data you normally do not need to use any Buffered streams. Just make sure you use a decently sized (8-64k) temporary byte[] buffer passed to the read method (or use a pump method which does it). The default buffer size is too small for most usages (and if you use a larger temp array it will be ignored anyway)

Is FileInputStream using buffers already?

When I am using FileInputStream to read an object (say a few bytes), does the underlying operation involve:
1) Reading a whole block of disk so that if I subsequently do another read operation, it wouldnt require a real disk read as that portion of the file was already fetched in the last read operation?
OR
2) A new disk access to take place because FileInputStream does not do any buffering and bufferedInputStream should have been used instead to achieve the effect of (1)?
I think that since the FileInputStream uses the read system call and it reads only a set of pages from hard disk, some buffering must be take place.
FileInputStream will make an underlying native system call. Most OSes will do their own buffering for this. So it does not need a real disk seek for each byte. But still, you have the cost of making the native OS call, which is expensive. So BufferedStream would be preferable. However, for reading small amounts of data (like you say, a few bytes or even kBs), either one should be fine as the number of OS calls won't be that different.
Native code for FileInputStream is here: it doesn't look like there is any buffering going on in there. The OS buffering may kick in, but there's no explicit indicator one way or another if/when that happens.
One thing to look out for is reading from a mounted network volume over a slow connection. I ran into a big performance issue using a non-buffered FileInputStream for this. Didn't catch it in development, because the file system was local.

Categories