I need to write an algorithm which downloads parts of a file from different locations and merges them all in a single file on my local drive. The file may be huge (several giga bytes) but each part is small.
Each part has a header which says the file it's part of and also the offset byte where it's located in the file.
Every part is downloaded in its own thread.
Just after the header has been decoded, I open a MappedByteBuffer:
MappedByteBuffer memoryMappedFile;
try (RandomAccessFile raf = new RandomAccessFile(file.toFile(), "rw")) {
memoryMappedFile = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, offset, mappedSize);
}
The problem is that several threads could execute this code at the same time (while trying to map different parts of the same file) and it causes IOException (thrown by the above 'map' method) with message "This operation cannot be performed on a file having an open user mapped section" (this is a translation from a localized message).
If I synchronize the whole block, the exception is not thrown. So I guess it's ok to use several MappedByteBuffer on the same file as long as they're not open at the same time. Is it possible to achieve this result without synchronizing this part and if not, is there a better solution?
The program runs on Windows 8.1
Related
I want to test to see if I can lock file X. If the file doesn't exist or can't be locked, fail. It sounds simple, but I keep running into dead-ends. A short-cut to the answer would be to provide a way to get a FileChannel I can make an exclusive lock on, without any risk of creating a file. Let me explain...
I can't use NIO lock() without a writable FileChannel and I can't get a writable FileChannel without opening file X in such a way that if it doesn't exist, it is created. Every Java method I've found for trying to open file X as writable creates it if it doesn't exist and there doesn't seem to be a way to exclusively lock a file on a read-only FileChannel.
Even checking to confirm the file exists first is problematic. Here's my latest code:
private LockStatus openAndLockFile(File file) {
if (Files.notExists(file.toPath())) {
synchronized (fileList) {
removeFile(file);
}
return LockStatus.FILE_NOT_FOUND;
}
try {
rwFile = new RandomAccessFile(file, "rw");
fileLock = rwFile.getChannel().lock();
}
catch ...
The problem with this code is that the file might exist when the NotExists code runs, but be gone by the time the new RandomAccessFile(file, "rw") runs. This is because I'm running this code in multiple threads and multiple processes which means the code has to be air-tight and rock solid.
Here's an example of the problem with my existing code running in two processes:
1: Process A detects the new file
2: Process B detects the same file
3: Process A processes the file and moves it to another folder
Problem ---> Process B Accidentally creates an empty file. OOPS!!!
4: Process B detects the new file created by process B and processes it creating a duplicate file which is 0 bytes.
5: Process A also detects the new file accidentally created by process B and tries to process it...
Bigger screenshot
Here's an example using C# of what I'm trying to do:
Stream iStream = File.Open("c:\\software\\code.txt", FileMode.Open,
FileAccess.Read, FileShare.None)
Any help or hints are greatly appreciated! Thanks!
If you are trying to prevent two threads in the same application (same JVM) from processing the same file, then you should be implementing this using regular Java locks, not file locks. File locks are granted to the JVM and are reentrant ... so if one thread "locks" a file, another thread can acquire a lock on the same file.
What I would do is to create a thread-safe locking class that wraps a HashSet<File>, where the File objects denote absolute file paths for files that exist. Then implement "file locking" by locking on the File objects.
Unfortunately not only will they be in different JVMs, they will sometimes be on different server.
In that case, the best strategy is probably to use a database to implement the locks.
Is it possible to read and write to the same text file with both a C++ application and a java application at the same time without writing conflicting lines / characters to it ? I have tested with two java applications for now, and it seems like it's possible to write to the file from one process even if the other process has opened the stream, but not closed it. Are there any way to lock the file so that the other process needs to wait ?
I think yes, for example boost::interprocess http://www.boost.org/doc/libs/1_50_0/doc/html/interprocess.html file locks http://www.boost.org/doc/libs/1_50_0/doc/html/interprocess/synchronization_mechanisms.html#interprocess.synchronization_mechanisms.file_lock
For two processes that are writing to the same file, as long as you flush your output buffers on line boundaries (i.e., flush after you write a newline character sequence), the data written to the file should be intervleaved nicely.
If one process is writing while another is reading from the same file, you just have to ensure that the reads don't get ahead of the writes. If a read gets an end-of-file condition (or worse, a partial data line), then you know that the reading process must wait until the writing process has finished writing another chunk of data to the file.
If you need more complicated read/write control, you should consider some kind of locking mechanism.
Is there any method so that I can split a text file in java without reading it?
I want to process a large text file in GB's, so I want to split file in small parts and apply thread over each file and combine result for it.
As I will be reading it for small parts then splitting a file by reading it won't make any sense as I will have to read same file for twice and it will degrade my performance.
Your threading attempt is ill formed. If you have to do significant processing with your file data consider following threading structure:
1 Reader Thread (Reads the File and feeds the workers )
Queue with read chunks
1..n Worker Threads (n depends on your cpu cores, processes the data chunks from the reader thread)
Queue or dictionary with processed chunks
1 Writer Thread ( Writes results to some file)
Maybe you could combine the Reader / Writer thread into one thread because it doesn't make much sense to parallelize IO on the same physical harddisk.
It's clear that you need some synchronization stuff between the threads. Especially for queues think about semaphores
Without reading the content of file you can't do that. That is not possible.
I don't think this is possible for the following reasons:
How do you write a file without "reading" it?
You'll need to read in the text to know where a character boundary is (the encoding is not necessarily 1 byte). This means that you cannot treat the file as binary.
Is it really not possible to read line-by line and process it like that? That also saves additional space that the split files will take up alongside the original. For you reference, reading a text file is simply:
public static void loadFileFromInputStream(InputStream in) throws IOException {
BufferedReader inputStream = new BufferedReader(new InputStreamReader(in));
String record = inputStream.readLine();
while (record != null) {
// do something with the record
// ...
record = inputStream.readLine();
}
}
You're only reading one line at a time... so the size of the file does not impact performance at all. You can also stop anytime you have to. If you're adventurous you can also add the lines to separate threads to speed up processing. That way, IO can continue churning along while you process your data.
Good luck! If, for some reason, you do find a solution, please post it here. Thanks!
Technically speaking - it cant be done without reading the file. But you also dont need to keep the entire file contents in memory to do the splitting. Just open a stream to the file and write out to other files by redirecting output to another file after certain number of bytes are written to one file. This way you are not required to keep more than one byte of file data in memory at any given time. But having a larger buffer, about 8 or 16kb will be dramatically increase performance.
Something has to read your file to split it (and you probably want to split it at line barriers, probably not at some multiple of kilobytes).
If running on Linux machine, you could delegate the splitting to an external command like csplit. So your Java program would simply run a csplit yourbigfile.txt command.
In the literal sense no. To literally split a file into smaller files, you have to read the large one and write the smaller ones.
However, I think you really want to know if you can have different threads sequentially reading different "parts" of a file at the same time. And the answer is that you can do that. Just have each thread create its own RandomAccessFile object for the file, seek to the relevant place, and start reading.
(A FileInputStream would probably work too, though I don't think that the Java API spec guarantees that skip is implemented using a OS level "seek" operation on the file.)
There are a couple of possible complications:
If the file is text, you presumably want each thread to start processing at the start of some line in the file. So each thread has to start by finding the end of a line, and make sure that it reads to the end of the last line in its "part".
If the file uses a variable width character encoding (e.g. UTF-8), then you need to deal with the case where your partition boundaries fall in the middle of a character.
I'm doing some file I/O with multiple files (writing to 19 files, it so happens). After writing to them a few hundred times I get the Java IOException: Too many open files. But I actually have only a few files opened at once. What is the problem here? I can verify that the writes were successful.
On Linux and other UNIX / UNIX-like platforms, the OS places a limit on the number of open file descriptors that a process may have at any given time. In the old days, this limit used to be hardwired1, and relatively small. These days it is much larger (hundreds / thousands), and subject to a "soft" per-process configurable resource limit. (Look up the ulimit shell builtin ...)
Your Java application must be exceeding the per-process file descriptor limit.
You say that you have 19 files open, and that after a few hundred times you get an IOException saying "too many files open". Now this particular exception can ONLY happen when a new file descriptor is requested; i.e. when you are opening a file (or a pipe or a socket). You can verify this by printing the stacktrace for the IOException.
Unless your application is being run with a small resource limit (which seems unlikely), it follows that it must be repeatedly opening files / sockets / pipes, and failing to close them. Find out why that is happening and you should be able to figure out what to do about it.
FYI, the following pattern is a safe way to write to files that is guaranteed not to leak file descriptors.
Writer w = new FileWriter(...);
try {
// write stuff to the file
} finally {
try {
w.close();
} catch (IOException ex) {
// Log error writing file and bail out.
}
}
1 - Hardwired, as in compiled into the kernel. Changing the number of available fd slots required a recompilation ... and could result in less memory being available for other things. In the days when Unix commonly ran on 16-bit machines, these things really mattered.
UPDATE
The Java 7 way is more concise:
try (Writer w = new FileWriter(...)) {
// write stuff to the file
} // the `w` resource is automatically closed
UPDATE 2
Apparently you can also encounter a "too many files open" while attempting to run an external program. The basic cause is as described above. However, the reason that you encounter this in exec(...) is that the JVM is attempting to create "pipe" file descriptors that will be connected to the external application's standard input / output / error.
For UNIX:
As Stephen C has suggested, changing the maximum file descriptor value to a higher value avoids this problem.
Try looking at your present file descriptor capacity:
$ ulimit -n
Then change the limit according to your requirements.
$ ulimit -n <value>
Note that this just changes the limits in the current shell and any child / descendant process. To make the change "stick" you need to put it into the relevant shell script or initialization file.
You're obviously not closing your file descriptors before opening new ones. Are you on windows or linux?
Although in most general cases the error is quite clearly that file handles have not been closed, I just encountered an instance with JDK7 on Linux that well... is sufficiently ****ed up to explain here.
The program opened a FileOutputStream (fos), a BufferedOutputStream (bos) and a DataOutputStream (dos). After writing to the dataoutputstream, the dos was closed and I thought everything went fine.
Internally however, the dos, tried to flush the bos, which returned a Disk Full error. That exception was eaten by the DataOutputStream, and as a consequence the underlying bos was not closed, hence the fos was still open.
At a later stage that file was then renamed from (something with a .tmp) to its real name. Thereby, the java file descriptor trackers lost track of the original .tmp, yet it was still open !
To solve this, I had to first flush the DataOutputStream myself, retrieve the IOException and close the FileOutputStream myself.
I hope this helps someone.
If you're seeing this in automated tests: it's best to properly close all files between test runs.
If you're not sure which file(s) you have left open, a good place to start is the "open" calls which are throwing exceptions! 😄
If you have a file handle should be open exactly as long as its parent object is alive, you could add a finalize method on the parent that calls close on the file handle. And call System.gc() between tests.
Recently, I had a program batch processing files, I have certainly closed each file in the loop, but the error still there.
And later, I resolved this problem by garbage collect eagerly every hundreds of files:
int index;
while () {
try {
// do with outputStream...
} finally {
out.close();
}
if (index++ % 100 = 0)
System.gc();
}
I am using gzip utilities in windows machine. I compressed a file and stored in the DB as blob. When I want to decompress this file using gzip utility I am writing this byte stream to process.getOutputStream. But after 30KB, it was unable to read the file. It hangs there.
Tried with memory arguments, read and flush logic. But the same data if I try to write to a file it is pretty fast.
OutputStream stdin = proc.getOutputStream();
Blob blob = Hibernate.createBlob(inputFileReader);
InputStream source = blob.getBinaryStream();
byte[] buffer = new byte[256];
long readBufferCount = 0;
while (source.read(buffer) > 0)
{
stdin.write(buffer);
stdin.flush();
log.info("Reading the file - Read bytes: " + readBufferCount);
readBufferCount = readBufferCount + 256;
}
stdin.flush();
Regards,
Mani Kumar Adari.
I suspect that the problem is that the external process (connected to proc) is either
not reading its standard input, or
it is writing stuff to its standard output that your Java application is not reading.
Bear in mind that Java talks to the external process using a pair of "pipes", and these have a limited amount of buffering. If you exceed the buffering capacity of a pipe, the writer process will be blocked writing to the pipe until the reader process has read enough data from the pipe to make space. If the reader doesn't read, then the pipeline locks up.
If you provided more context (e.g. the part of the application that launches the gzip process) I'd be able to be more definitive.
FOLLOWUP
gzip.exe is a unix utility in windows we are using. gzip.exe in command prompt working fine. But Not with the java program. Is there any way we can increase the buffering size which java writes to a pipe. I am concerned about the input part at present.
On UNIX, the gzip utility is typically used one of two ways:
gzip file compresses file turning it into file.gz.
... | gzip | ... (or something similar) which writes a compressed version of its standard input to its standard output.
I suspect that you are doing the equivalent of the latter, with the java application as both the source of the gzip command's input and the destination of its output. And this is the precisely the scenario that can lock up ... if the java application is not implemented correctly. For instance:
Process proc = Runtime.exec(...); // gzip.exe pathname.
OutputStream out = proc.getOutputStream();
while (...) {
out.write(...);
}
out.flush();
InputStream in = proc.getInputStream();
while (...) {
in.read(...);
}
If the write phase of the application above writes too much data, it is guaranteed to lockup.
Communication between the java application and gzip is via two pipes. As I stated above, a pipe will buffer a certain amount of data, but that amount is relatively small, and certainly bounded. This is the cause of the lockup. Here is what happens:
The gzip process is creates with a pair of pipes connecting it to the Java application process.
The Java application writes data to its out stream
The gzip processes reads that data from its standard input, compresses it and writes to its standard output.
Steps 2. and 3. are repeated a few times, until finally the gzip processes attempt to write to its standard output blocks.
What has been happening is that gzip has been writing into its output pipe, but nothing has been reading from it. Eventually, we reach the point where we've exhausted the output pipe's buffer capacity, and the write to the pipe blocks.
Meanwhile, the Java application is still writing to the out Stream, and after a couple more rounds, this too blocks because we've filled the other pipe.
The only solution is for the Java application to read and write at the same time. The simple way to do this is to create a second thread and do the writing to the external process from one thread and the reading from the process in the other one.
(Changing the Java buffering or the Java read / write sizes won't help. The buffering that matters is in the OS implementations of the pipes, and there's no way to change that from pure Java, if at all.)