I want to append a line of text into a file - however, I want to get the position of the string in the file, such that I can access the string directly using a RandomAccessFile and file.seek() (or similar)
The issue is that alot of file i/o operations are asynch, and the write operations can happen within very short time intervals - suggesting a asynch write, since everything else is inefficient. How do I make sure the filepointer is calculated correctly? I am a newcomer to Java and dont yet understand the details of the different methods of File I/O, so excuse my Question if using a BufferedWriter is exactly what I am looking for, but how do you get the current length of that?
EDIT: Reading the entire file is NOT an option. The file is large and as I said, the write operations happen often, several hundred every second in peak times.
Refer to the FileChannel class: http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
Relevant snippets from the link:
The file itself contains a variable-length sequence of bytes that can be read and written and whose current size can be queried. The size of the file increases when bytes are written beyond its current size;
...
File channels are safe for use by multiple concurrent threads. The close method may be invoked at any time, as specified by the Channel interface. Only one operation that involves the channel's position or can change its file's size may be in progress at any given time; attempts to initiate a second such operation while the first is still in progress will block until the first operation completes. Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.
Related
If calling new File(path).exists() a call to the file system is generally needed, so in my understanding this is not an appropriate call to make from a single-threaded event loop. All the nio-based file classes such as AsynchronousFileChannel appear to do non-blocking reads or writes, but checking for the existence of the file appears to be blocking. Is there a way to check a file exists, and/or get metadata such as file size in a non-blocking fashion?
As per the comment from #davmac the answer is "you can't".
so in my understanding this is not an appropriate call to make from a single-threaded event loop such as in Netty.
Your understanding is not correct, and this is a non sequitur.
All the nio-based file classes such as FileChannel appear to do non-blocking reads or writes
That's not correct either. FileChannel does not perform non-blocking I/O.
but opening the file appears to be blocking.
You will havrno explain how you arrive at that conclusion and how it is adversely affecting you.
Is there a way to check a file exists, and/or get metadata such as file size in a non-blocking fashion?
There are several ways to check for file existence in Java, such as File.exists(), whatever the corresponding method inFiles is, and simply trying to open it and catching the FileNotFoundException that results if it doesn't.
None of them blocks, at least not for any appreciable amount of time
I'm trying to use a MappedByteBuffer to allow concurrent reads on a file by multiple threads with the following constraints:
File is too large to load into memory
Threads must be able to read asynchronously (it's a web app)
The file is never written to by any thread
Every thread will always know the exact offset and length of bytes it needs to read (ie - no "seeking" by the app itself).
According to the docs (https://docs.oracle.com/javase/8/docs/api/java/nio/Buffer.html) Buffers are not thread-safe since they keep internal state (position, etc). Is there a way to have concurrent random access to the file without loading it all into memory?
Although FileChannel is technically thread-safe, from the docs:
Where the file channel is obtained from an existing stream or random access file then the state of the file channel is intimately connected to that of the object whose getChannel method returned the channel. Changing the channel's position, whether explicitly or by reading or writing bytes, will change the file position of the originating object, and vice versa
So it would seem that it's simply synchronized. If I were to new RandomAccessFile().getChannel().map() in each thread [edit: on every read] then doesn't that incur the I/O overhead with each read that MappedByteBuffers are supposed to avoid?
Rather than using multiple threads for concurrent reads, I'd go with this approach (based on an example with a huge CSV file whose lines have to be sent concurrently via HTTP):
Reading a single file at multiple positions concurrently wouldn't let you go any faster (but it could slow you down considerably).
Instead of reading the file from multiple threads, read the file from a single thread, and parallelize the processing of these lines. A single thread should read your CSV line-by-line, and put each line in a queue. Multiple working threads should then take the next line from the queue, parse it, convert to a request, and process the request concurrently as needed. The splitting of the work would then be done by a single thread, ensuring that there are no missing lines or overlaps.
If you can read the file line by line, LineIterator from Commons IO is a memory-efficient possibility. If you have to work with chunks, your MappedByteBuffer seems to be a reasonable approach. For the queue, I'd use a blocking queue with a fixed capacity—such as ArrayBlockingQueue—to better control the memory usage (lines/chunks in queue + lines/chunks among workers = lines/chunks in memory).
FileChannel supports a read operation without synchronization. It natively uses pread on Linux:
public abstract int read(ByteBuffer dst, long position) throws IOException
Here is on the FileChannel documentation:
...Other operations, in particular those that take an explicit position, may proceed concurrently; whether they in fact do so is dependent upon the underlying implementation and is therefore unspecified.
It is pretty primitive by returning how many bytes were read (see details here). But I think you can still make use of it with your assumption that "Every thread will always know the exact offset and length of bytes it needs to read"
I am receiving an infinite stream of data as input and I am writing it to a file "out.dat". At the same time, I have to process the data with a slow function process_data(). What I would like to do is to have a thread that reads data from the file "out.dat" and processes it with process_data(). So, in the main thread, I receive the input data and write it to "out.dat", in the second I read the data from "out.dat" and process it.
(The reason for not using an exchange buffer between the two threads is that it should be quite big as the input stream is fast and process_data() is slow. Over time, the exchange buffer will grow too much.)
It seems that using FileChannel would be appropriate. In the documentation I read "The view of a file provided by an instance of this class is guaranteed to be consistent with other views of the same file provided by other instances in the same program.".
However, I am not sure since it does not specify what "consistent" really means. My need is to write data to "out.dat" and, later, from another thread, read (FileChannel#read) the same data again. Does "consistent" mean that once I write the data, say at position 100000, when I do read(byteBuffer, 100000) I get the same data?
Do I have to call FileChannel#force to force the writings to be saved on disk?
Thank you in advance.
I have two (Java) processes on different JVMs running repeatedly. The first one regularly finds some "information" and needs to store it somewhere. The second process regularly reads this information to handle it. The intervals are more or less random, so process 1 may find three pieces of information until process 2 reads them or vice versa.
My approach is to write this information to text files. But I am afraid that appending and reading the text files accidentally happens at the same time so that I run into locks. But writing a new text file for each piece of information seems like overkill.
What would be a better solution?
EDIT: I am sorry, I did not make clear: The java processes run in different JVMs. They cannot see each other directly.
You can get this to work, provided you are careful with file handling and you don't have a high update rate e.g. 10 updates per second.
Note: you could do it with file renaming instead of locks.
What would be a better solution?
Just about anything, SO is not for recommending things, but in this case I could recommend just about anything without more specific requirements. I could for example recommend my library Chronicle Queue because I wrote it and I sure it could do what you want, however there are many possible alternatives.
I am sending about one line of text every minute.
So you can write a temporary file for each message, rename it when finished. The consumer can have a directory watcher so it knows as soon as you have done this. The consumer could delete the file when done. This has an overhead but it would be less than 10 ms.
If you want to keep a record of all messages, the producer can also write to a log file.
Is there any method so that I can split a text file in java without reading it?
I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.
As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.
Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:
1 Reader Thread (Reads the File and feeds the workers )
Queue with read chunks
1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)
Queue or dictionary with processed chunks
1 Writer Thread ( Writes results to some file)
Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.
It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores
Without reading the content of file you can't do that. That is not possible.
I don't think this is possible for the following reasons:
How do you write a file without "reading" it?
You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.
Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:
public static void loadFileFromInputStream(InputStream in) throws IOException {
BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));
String record = inputStream.readLine();
while (record != null) {
// do something with the record
// ...
record = inputStream.readLine();
}
}
You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.
Good luck! If, for some reason, you do find a solution, please post it here. Thanks!
Technically speaking - it cant be done without reading the file. But you also dont need to keep the entire file contents in memory to do the splitting. Just open a stream to the file and write out to other files by redirecting output to another file after certain number of bytes are written to one file. This way you are not required to keep more than one byte of file data in memory at any given time. But having a larger buffer, about 8 or 16kb will be dramatically increase performance.
Something has to read your file to split it (and you probably want to split it at line barriers, probably not at some multiple of kilobytes).
If running on Linux machine, you could delegate the splitting to an external command like csplit. So your Java program would simply run a csplit yourbigfile.txt command.
In the literal sense no. To literally split a file into smaller files, you have to read the large one and write the smaller ones.
However, I think you really want to know if you can have different threads sequentially reading different "parts" of a file at the same time. And the answer is that you can do that. Just have each thread create its own RandomAccessFile object for the file, seek to the relevant place, and start reading.
(A FileInputStream would probably work too, though I don't think that the Java API spec guarantees that skip is implemented using a OS level "seek" operation on the file.)
There are a couple of possible complications:
If the file is text, you presumably want each thread to start processing at the start of some line in the file. So each thread has to start by finding the end of a line, and make sure that it reads to the end of the last line in its "part".
If the file uses a variable width character encoding (e.g. UTF-8), then you need to deal with the case where your partition boundaries fall in the middle of a character.