I am receiving an infinite stream of data as input and I am writing it to a file "out.dat". At the same time, I have to process the data with a slow function process_data(). What I would like to do is to have a thread that reads data from the file "out.dat" and processes it with process_data(). So, in the main thread, I receive the input data and write it to "out.dat", in the second I read the data from "out.dat" and process it.
(The reason for not using an exchange buffer between the two threads is that it should be quite big as the input stream is fast and process_data() is slow. Over time, the exchange buffer will grow too much.)
It seems that using FileChannel would be appropriate. In the documentation I read "The view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program.".
However, I am not sure since it does not specify what "consistent" really means. My need is to write data to "out.dat" and, later, from another thread, read (FileChannel#read) the same data again. Does "consistent" mean that once I write the data, say at position 100000, when I do read(byteBuffer, 100000) I get the same data?
Do I have to call FileChannel#force to force the writings to be saved on disk?
Thank you in advance.
Related
I'm trying to use a MappedByteBuffer to allow concurrent reads on a file by multiple threads with the following constraints:
File is too large to load into memory
Threads must be able to read asynchronously (it's a web app)
The file is never written to by any thread
Every thread will always know the exact offset and length of bytes it needs to read (ie - no "seeking" by the app itself).
According to the docs (https://docs.oracle.com/javase/8/docs/api/java/nio/Buffer.html) Buffers are not thread-safe since they keep internal state (position, etc). Is there a way to have concurrent random access to the file without loading it all into memory?
Although FileChannel is technically thread-safe, from the docs:
Where the file channel is obtained from an existing stream or random access file then the state of the file channel is intimately connected to that of the object whose getChannel method returned the channel. Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa
So it would seem that it's simply synchronized. If I were to new RandomAccessFile().getChannel().map() in each thread [edit: on every read] then doesn't that incur the I/O overhead with each read that MappedByteBuffers are supposed to avoid?
Rather than using multiple threads for concurrent reads, I'd go with this approach (based on an example with a huge CSV file whose lines have to be sent concurrently via HTTP):
Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).
Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A single thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.
If you can read the file line by line, LineIterator from Commons IO is a memory-efficient possibility. If you have to work with chunks, your MappedByteBuffer seems to be a reasonable approach. For the queue, I'd use a blocking queue with a fixed capacity—such as ArrayBlockingQueue—to better control the memory usage (lines/chunks in queue + lines/chunks among workers = lines/chunks in memory).
FileChannel supports a read operation without synchronization. It natively uses pread on Linux:
public abstract int read(ByteBuffer dst, long position) throws IOException
Here is on the FileChannel documentation:
...Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.
It is pretty primitive by returning how many bytes were read (see details here). But I think you can still make use of it with your assumption that "Every thread will always know the exact offset and length of bytes it needs to read"
In Java, flush() method is used in streams. But I don't understand what are all the purpose of using this method?
fin.flush();
tell me some suggestions.
From the docs of the flush method:
Flushes the output stream and forces any buffered output bytes to be written out. The general contract of flush is that calling it is an indication that, if any bytes previously written have been buffered by the implementation of the output stream, such bytes should immediately be written to their intended destination.
The buffering is mainly done to improve the I/O performance. More on this can be read from this article: Tuning Java I/O Performance.
When you write data to a stream, it is not written immediately, and it is buffered. So use flush() when you need to be sure that all your data from buffer is written.
We need to be sure that all the writes are completed before we close the stream, and that is why flush() is called in file/buffered writer's close().
But if you have a requirement that all your writes be saved anytime before you close the stream, use flush().
When we give any command, the streams of that command are stored in the memory location called buffer(a temporary memory location) in our computer. When all the temporary memory location is full then we use flush(), which flushes all the streams of data and executes them completely and gives a new space to new streams in buffer temporary location.
-Hope you will understand
If the buffer is full, all strings that is buffered on it, they will be saved onto the disk. Buffers is used for avoiding from Big Deals! and overhead.
In BufferedWriter class that is placed in java libs, there is a one line like:
private static int defaultCharBufferSize = 8192;
If you do want to send data before the buffer is full, you do have control. Just Flush It. Calls to writer.flush() say, "send whatever's in the buffer, now!
reference book: https://www.amazon.com/Head-First-Java-Kathy-Sierra/dp/0596009208
pages:453
In addition to other good answers here, this explanation made it very clear for me:
A buffer is a portion in memory that is used to store a stream of data
(characters). These characters sometimes will only get sent to an
output device (e.g. monitor) when the buffer is full or meets a
certain number of characters. This can cause your system to lag if you
just have a few characters to send to an output device. The flush()
method will immediately flush the contents of the buffer to the output
stream.
Source: https://www.youtube.com/watch?v=MjK3dZTc0Lg
Streams are often accessed by threads that periodically empty their content and, for example, display it on the screen, send it to a socket or write it to a file. This is done for performance reasons. Flushing an output stream means that you want to stop, wait for the content of the stream to be completely transferred to its destination, and then resume execution with the stream empty and the content sent.
For performance issue, first data is to be written into Buffer. When buffer get full then data is written to output (File,console etc.). When buffer is partially filled and you want to send it to output(file,console) then you need to call flush() method manually in order to write partially filled buffer to output(file,console).
I want to append a line of text into a file - however, I want to get the position of the string in the file, such that I can access the string directly using a RandomAccessFile and file.seek() (or similar)
The issue is that alot of file i/o operations are asynch, and the write operations can happen within very short time intervals - suggesting a asynch write, since everything else is inefficient. How do I make sure the filepointer is calculated correctly? I am a newcomer to Java and dont yet understand the details of the different methods of File I/O, so excuse my Question if using a BufferedWriter is exactly what I am looking for, but how do you get the current length of that?
EDIT: Reading the entire file is NOT an option. The file is large and as I said, the write operations happen often, several hundred every second in peak times.
Refer to the FileChannel class: http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
Relevant snippets from the link:
The file itself contains a variable-length sequence of bytes that can be read and written and whose current size can be queried. The size of the file increases when bytes are written beyond its current size;
...
File channels are safe for use by multiple concurrent threads. The close method may be invoked at any time, as specified by the Channel interface. Only one operation that involves the channel's position or can change its file's size may be in progress at any given time; attempts to initiate a second such operation while the first is still in progress will block until the first operation completes. Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.
I am trying to understand piped streams.
Instead of piped stream why can't we use other streams to pipe each other? like below:
final ByteArrayOutputStream pos = new ByteArrayOutputStream();
final ByteArrayInputStream pis = new ByteArrayInputStream(pos.toByteArray());
and when will we have a deadlock in a piped stream? I tried to read and write using single main thread, but it executes smoothly.
The difficulty here is that the process must be implemented in several threads because writing to one end of the pipe must be matched with a read at the other end.
It is certainly not difficult to create a thread to monitor arrivals at the end of one pipe and push them back through another pipe but it cannot be done with a single thread.
Have you looked at this question?
Piped streams allow for efficient byte-by-byte processing with minimal effort.
I could very well be wrong, but I believe toByteArray() might not do what you think it does. It just copies the current contents, not any contents in future.
So the only real issue here is management of this, which would be a bit more difficult. You'd have to constantly poll the output stream. Not to mention the memory allocation of an array for each call to toByteArray (which "Creates a newly allocated byte array" for each call).
How I suspect deadlocks may happen in a single thread:
If you try to (blocking) read from an input stream that doesn't have data yet. It will never be able to get data because data can only be obtained from the output stream to which must be written in the same thread, which can't happen while you're sitting waiting for data.
So, in a single thread, it will happen if you're not very careful, but it should be possible to successfully use them in the same thread without deadlocks, but why would you want to? I think another data structure may be better suited, like a linked-list or simple circular array.
I read that invoking flush method guarantees that the last of the data you thought you had already written actually gets out to the file.I didn't get the meaning of this statement can any one explain clearly what actually flush method invocation will do?
The writers are usually buffered so it waits for the buffer to be filled before it writes it to the file. Flush tells to write the buffer even though it might not be filled yet. It's usually useful when you finish the writing since the last buffer may not be full but you want to finish the writing.
Many streams have internal buffers which they use to store data before it is passed on. This prevents a file stream from having to continually write each individual byte to disk (which can be quite expensive). The flush command forces a stream to clear its internal buffers so that, in this case, everything is forced to disk.