(This is a hypothetical question since it's very broad, and workarounds exist for specific cases.)
Is it possible to atomically write a byte[] to a file (as FileOutputStream or FileWriter?
If writing fails, then it's unacceptable that part of the array is written. For example, if the array is 1,000,000 bytes and the disk is full after 500,000 bytes, then no bytes should be written to the file, or the changes should somehow be rolled back. This should even be the case if a medium is physically disconnected mid-write.
Assume that the maximum size of the array is known.
Atomic writes to files are not possible. Operating systems don't support it, and since they don't, programming language libraries can't do it either.
The best you are going to get with a files in a conventional file system is atomic file renaming; i.e.
write new file into same file system as the old one
use FileDescriptor.sync() to ensure that new file is written
rename the new file over the old one; e.g. using
java.nio.file.Files.move(Path source, Path target,
CopyOption... options)
with CopyOptions ATOMIC_MOVE. According to the javadocs, this may not be supported, but if it isn't supported you should get an exception.
But note that the atomicity is implemented in the OS, and if the OS cannot give strong enough guarantees, you are out of luck.
(One issue is what might happen in the event of a hard disk error. If the disk dies completely, then atomicity is moot. But if the OS is still able to read data from the disk after the failure, then the outcome may depend on the OS'es ability to repair a possibly inconsistent file system.)
Related
I know this question has been widely discussed in different posts:
java get file size efficiently
How to get file size in Java
Get size of folder or file
https://www.mkyong.com/java/how-to-get-file-size-in-java/
My problem is that I need to obtain the sizes of a large number of files (regular files existing in a HD), and for this I need a solution that provides the best performance. My intuition is that it should be done through a method that reads directly the file system table, not obtaining the size of the file by reading the whole file contents. It is difficult to know which specific method is used by reading the documentation.
As stated in this page:
Files has the size() method to determine the size of the file. This is
the most recent API and it is recommended for new Java applications.
But this is apparently not the best advise, in terms of performance. I have made different measurements of different methods:
file.length();
Files.size(path);
BasicFileAttributes attr = Files.readAttributes(path, BasicFileAttributes.class); attr.size();
And my surprise is that file.length(); is the fastest, having to create a File object instead of using the newer Path. I do not now if this also reads the file system or the contents. So my question is:
What is the fastest, recommended way to obtain file sizes in recent Java versions (9/10/11)?
EDIT
I do not think these details add anything to the question. Basically the benchmark reads like this:
Length: 49852 with previous instanciation: 84676
Files: 3451537 with previous instanciation: 5722015
Length: 48019 with previous instanciation:: 79910
Length: 47653 with previous instanciation:: 86875
Files: 83576 with previous instanciation: 125730
BasicFileAttr: 333571 with previous instanciation:: 366928
.....
Length is quite consistent. Files is noticeable slow on the first call, but it must cache something since later calls are faster (still slower than Length). This is what other people observed in some of the links I reference above. BasicFileAttr was my hope but still is slow.
I am asing what is recommended in modern Java versions, and I considered 9/10/11 as "modern". It is not a dependency, nor a limitation, but I suppose Java 11 is supposed to provide better means to get file sizes than Java 5. If Java 8 released the fastest way, that is OK.
It is not a premature optimisation, at the moment I am optimising a CRC check with an initial size check, because it should be much faster and does not need, in theory, to read file contents. So I can use directly the "old" Length method, and all I am asking is what are the new advances on this respect in modern Java, since the new methods are apparently not as fast as the old ones.
So far I managed to do something with Byte Stream : read the original file, and write in a new file while omitting the desired bytes (and then finish by deleting/renaming the files so that there's only one left).
I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file. The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.
I'd like to know if there's a way to directly modify the bytes without requiring to manipulate more than one file.
There isn't a SAFE way to do it.
The unsafe way to do it involves (for example) mapping the file using a MappedByteBuffer, and shuffling the bytes around.
But the problem is that if something goes wrong while you are doing this, you are liable to end up with a corrupted file.
Therefore, if the user asks to perform this operation when the device's memory is too full to hold a second copy of the file, the best thing is to tell the user to "delete some files first".
The reason is because this has to be performed when there is low memory and the file is too big, so cloning the file before trimming it may not be the best option.
If you are primarily worried about "memory" on the storage device, see above.
If you are worried about RAM, then #RealSkeptic's observation is correct. You shouldn't need to hold the entire file in RAM at the same time. You can read, modify, write it a buffer at a time.
You can't remove bytes in the middle of the file without placing the rest of the file in memory. But you can replace bytes if it can help you.
I'm developing a basic download manager that can download a file over http using multiple connections. At the end of the download, I have several temp file containing each a part of the file.
I now want to merge them into a single file.
It's not hard to do so, simply create an output stream and input streams and pipe the inputs into the output in the good order.
But I was wondering: is there a way to do it more efficiently? I mean, from my understanding what will happen here is that the JVM will read byte per byte the inputs, and write byte per byte the output.
So basically I have :
- read byte from disk
- store byte in memory
- some CPU instructions will probably run and the byte will probably be copied into the CPU's cache
- write byte to the disk
I was wondering if there was a way to keep the process on the disk? I don't know if I'm understandable, but basically to tell the disk "hey disk, take these files of yours and make one with them"
In a short sentence, I want to reduce the CPU & memory usage to the lowest possible.
In theory it may be possible to do this operation on a file system level: you could append the block list from one inode to another without moving the data. This is not very practical though, most likely you would have to bypass your operating system and access the disk directly.
The next best thing may be using FileChannel.transferTo or transferFrom methods:
This method is potentially much more efficient than a simple loop that reads from this channel and writes to the target channel. Many operating systems can transfer bytes directly from the filesystem cache to the target channel without actually copying them.
You should also test reading and writing large blocks of bytes using streams or RandomAccessFile - it may still be faster than using channels. Here's a good article about testing sequential IO performance in Java.
I am using a ZipOutputStream to zip up a bunch of files that are a mix of already zipped formats as well as lots of large highly compressible formats like plain text.
Most of the already zipped formats are large files and it makes no sense to spend cpu and memory on recompressing them since they never get smaller and sometimes get slightly large on the rare occasion.
I am trying to use .setMethod(ZipEntry.STORED) when I detect a pre-compressed file but it complains that I need to supply the size, compressedSize and crc for those files.
I can get it work with the following approach but this requires that I read the file twice. Once to calculate the CRC32 then again to actually copy the file into the ZipOutputStream.
// code that determines the value of method omitted for brevity
if (STORED == method)
{
fze.setMethod(STORED);
fze.setCompressedSize(fe.attributes.size());
final HashingInputStream his = new HashingInputStream(Hashing.crc32(), fis);
ByteStreams.copy(his,ByteStreams.nullOutputStream());
fze.setCrc(his.hash().padToLong());
}
else
{
fze.setMethod(DEFLATED);
}
zos.putNextEntry(fze);
ByteStreams.copy(new FileInputStream(fe.path.toFile()), zos);
zos.closeEntry();
Is there a way provide this information without having to read the input stream twice?
Short Answer:
I could not determine a way to read the files only once and calculate the CRC with the standard library given the time I had to solve this problem.
I did find an optimization that decreased the time by about 50% on average.
I pre-calculate the CRC of the files to be stored concurrently with an ExecutorCompletionService limited to Runtime.getRuntime().availableProcessors() and wait until they are done. The effectiveness of this varies based on the number of files that need the CRC calculated. With the more files, the more benefit.
Then in the .postVisitDirectories() I wrap a ZipOutputStream around a PipedOutputStream from a PipedInputStream/PipedOutputStream pair running on a temporary Thread to convert the ZipOutputStream to an InputStream I can pass into the HttpRequest to upload the results of the ZipOutputStream to a remote server while writing all the precalculated ZipEntry/Path objects serially.
This is good enough for now, to process the 300+GB of immediate needs, but when I get to the 10TB job I will look at addressing it and trying to find some more advantages without adding too much complexity.
If I come up with something substantially better time wise I will update this answer with the new implementation.
Long answer:
I ended up writing a clean room ZipOutputStream that supports multipart zip files, intelligent compression levels vs STORE and was able to calculate the CRC as I read and then write out the metadata at the end of the stream.
Why ZipOutputStream.setLevel() swapping will not work:
The ZipOutputStream.setLevel(NO_COMPRESSION/DEFAULT_COMPRESSION)
hack is not a viable approach. I did extensive tests on hundreds of
gigs of data, thousands of folders and files and the measurements were
conclusive. It gains nothing over calculating the CRC for the
STORED files vs compressing them at NO_COMPRESSION. It is actually
slower by a large margin!
In my tests the files are on a network mounted drive so reading
the files already compressed files twice over the network to
calculate the CRC then again to add to the ZipOutputStream was as
fast or faster than just processing all the files once as DEFLATED
and changing the .setLevel() on the ZipOutputStream.
There is no local filesystem caching going on with the network access.
This is a worse case scenario, processing files on the local disk will
be much much faster because of local filesystem caching.
So this hack is a naive approach and is based on false assumptions. It is processing the
data through the compression algorithm even at NO_COMPRESSION level
and the overhead is higher than reading the files twice.
I'm using memory mapped files in some Java code to quickly write to a 2G file. I'm mapping the entire file into memory. The issue I have with my solution is that if the file I'm writing to mysteriously disappears or the disk has some type of error, those errors aren't getting bubbled up to the Java code.
In fact, from the Java code, it looks as though my write completed successfully. Here's the unit test I created to simulate this type of failure:
File twoGigFile = new File("big.bin");
RandomAccessFile raf = new RandomAccessFile(twoGigFile, "rw");
raf.setLength(Integer.MAX_VALUE);
raf.seek(30000); // Totally arbitrary
raf.writeInt(42);
raf.writeInt(42);
MappedByteBuffer buf = raf.getChannel().map(MapMode.READ_WRITE, 0, Integer.MAX_VALUE);
buf.force();
buf.position(1000000); // Totally arbitrary
buf.putInt(0);
assertTrue(twoGigFile.delete());
buf.putInt(0);
raf.close();
This code runs without any errors at all. This is quite an issue for me. I can't seem to find anything out there that speaks about this type of issue. Does anyone know how to get memory mapped files to correctly throw exceptions? Or if there is another way to ensure that the data is actually written to the file?
I'm trying to avoid using a RandomAccessFile because they are much slower than memory mapped files. However, there might not be any other option.
You can't. To quote the JavaDoc:
All or part of a mapped byte buffer may become inaccessible at any time [...] An attempt to access an inaccessible region of a mapped byte buffer will not change the buffer's content and will cause an unspecified exception to be thrown either at the time of the access or at some later time.
And here's why: when you use a mapped buffer, you are changing memory, not the file. The fact that the memory happens to be backed by the file is irrelevant until the OS attempts to write the buffered blocks to disk, which is something that is managed entirely by the OS (ie, your application will not know it's happening).
If you expect the file to disappear underneath you, then you'll have to use an alternate mechanism to see if this happens. One possibility is to occasionally touch the file using a RandomAccessFile, and catch the error that it will throw. Depending on your OS, even this may not be sufficient: on Linux, for example, a file exists for those programs that have an open handle to it, even if it has been deleted externally.