I've got a problem with MappedByteBuffer specially how it works internally. The way I understand it the caching is done completely by the Operating System. So if I read from the file (using MappedByteBuffer) the OS will read whole pages from the hard drive and saves the page in RAM for faster access when needed again. This also allows to provide a shared cache for multiple applications/processes which access the same file. Is this correct?
If so, how is it possible to invalidate this cache? Just reinitializing the Mapped-Object shouldn't work. I have written an application which reads a lot from the hard drive. I need to do a few benchmarks, so I need to clear this cache when needed. I've tried to use "echo 3 > /proc/sys/vm/drop_caches" but this doesn't make a difference so I think it is not working.
This also allows to provide a shared cache for multiple applications/processes which access the same file. Is this correct?
This is how it works on Linux, Windows and MacOS. On other OSes, it probably is the same.
If so, how is it possible to invalidate this cache?
delete the file and it will not longer be valid.
I need to do a few benchmarks, so I need to clear this cache when needed.
That is what the OS is for. If you need to force the cache to be invalid, this is tricky and entirely OS dependant.
I've tried to use "echo 3 > /proc/sys/vm/drop_caches" but this doesn't make a difference so I think it is not working.
It may have no impact on your benchmark. I suggest you look at /proc/meminfo for
Cached: 588104 kB
SwapCached: 264 kB
BTW if you want to unmap a MappedByteBuffer I do the following
public static void clean(ByteBuffer bb) {
if (bb instanceof DirectBuffer) {
Cleaner cl = ((DirectBuffer) bb).cleaner();
if (cl != null)
cl.clean();
}
}
This works for direct ByteBuffers as well, but probably won't work in Java 9 as this interface will be removed.
It's known sad issue (which is still unresolved in JDK): http://bugs.java.com/view_bug.do?bug_id=4724038
But even though there's no public API for that there's a dangerous workaround (use at your own risk): https://community.oracle.com/message/9387222
An alternative is not to use huge MappedByteBuffers and hope that they eventually will be garbage-collected.
In case what is needed is for different programs that want to map this file use their very own MappedByteBuffer copy out of this File, MapMode.PRIVATE could help.
Hope that helps.
Related
I know this question has been widely discussed in different posts:
java get file size efficiently
How to get file size in Java
Get size of folder or file
https://www.mkyong.com/java/how-to-get-file-size-in-java/
My problem is that I need to obtain the sizes of a large number of files (regular files existing in a HD), and for this I need a solution that provides the best performance. My intuition is that it should be done through a method that reads directly the file system table, not obtaining the size of the file by reading the whole file contents. It is difficult to know which specific method is used by reading the documentation.
As stated in this page:
Files has the size() method to determine the size of the file. This is
the most recent API and it is recommended for new Java applications.
But this is apparently not the best advise, in terms of performance. I have made different measurements of different methods:
file.length();
Files.size(path);
BasicFileAttributes attr = Files.readAttributes(path, BasicFileAttributes.class); attr.size();
And my surprise is that file.length(); is the fastest, having to create a File object instead of using the newer Path. I do not now if this also reads the file system or the contents. So my question is:
What is the fastest, recommended way to obtain file sizes in recent Java versions (9/10/11)?
EDIT
I do not think these details add anything to the question. Basically the benchmark reads like this:
Length: 49852 with previous instanciation: 84676
Files: 3451537 with previous instanciation: 5722015
Length: 48019 with previous instanciation:: 79910
Length: 47653 with previous instanciation:: 86875
Files: 83576 with previous instanciation: 125730
BasicFileAttr: 333571 with previous instanciation:: 366928
.....
Length is quite consistent. Files is noticeable slow on the first call, but it must cache something since later calls are faster (still slower than Length). This is what other people observed in some of the links I reference above. BasicFileAttr was my hope but still is slow.
I am asing what is recommended in modern Java versions, and I considered 9/10/11 as "modern". It is not a dependency, nor a limitation, but I suppose Java 11 is supposed to provide better means to get file sizes than Java 5. If Java 8 released the fastest way, that is OK.
It is not a premature optimisation, at the moment I am optimising a CRC check with an initial size check, because it should be much faster and does not need, in theory, to read file contents. So I can use directly the "old" Length method, and all I am asking is what are the new advances on this respect in modern Java, since the new methods are apparently not as fast as the old ones.
We have a file I/O bottleneck. We have a directory which contains lots of JPEG files, and we want to read them in in real time as a movie. Obviously this is not an ideal format, but this is a prototype object tracking system and there is no possibility to change the format as they are used elsewhere in the code.
From each file we build a frame object which basically means having a buffered image and an explicit bytebuffer containing all of the information from the image.
What is the best strategy for this? The data is on a SSD which in theory has read/write rates around 400Mb/s, but in practice is reading no more than 20 files per second (3-4Mb/s) using the naive implementation:
bufferedImg = ImageIO.read(imageFile);[1]
byte[] data = ((DataBufferByte)bufferedImg.getRaster().getDataBuffer()).getData();[2]
imgBuf = ByteBuffer.wrap(data);
However, Java produces lots of possibilities for improving this.
(1) CHannels. Esp File Channels
(2) Gathering/Scattering.
(3) Direct Buffering
(4) Memory Mapped Buffers
(5) MultiThreading - use a bunch of callables to access many files simultaneously.
(6) Wrapping the files in a single large file.
(7) Other things I haven't thought of yet.
I would just like to know if anyone has extensively tested the different options, and knows what is optimal? I assume that (3) is a must, but I would still like to optimise the reading of a single file as far as possible, and am unsure of the best strategy.
Bonus Question: In the code snipped above, when does the JVM actually 'hit the disk' and read in the contents of the file, is it [1] or is that just a file handler which `points' to the object? It kind of makes sense to lazily evaluate but I don't know how the implementation of the ImageIO class works.
ImageIO.read(imageFile)
As it returns BufferedImage, I assume it will hit disk and just not file handler.
When I want to make sure that the changes of a MappedByteBuffer should be synced back to disc do I need randomAcccessFile.getFD().sync() or mappedByteBuffer.force() or both? (In my simple tests none of them seems to be required - confusing ...)
Somebody has an idea about the actual underlying operations or at least could explain the differences if any?
First, FileDescriptor.sync is equivalent to FileChannel.force (calling the POSIX fsync method)
Second, in the Book "Java NIO" from Ron Hitchens (via google books) in the chapter about MappedByteBuffer it says
MappedByteBuffer.force() is similar to the method of the same name in the FileChannel class. It forces any changes made to the mapped buffer to be flushed out to permanent disc storage. When updating a file through MappedByteBuffer object, you should always use MappedByteBuffer.force() rather than FileChannel.force(). The channel object may not be aware of all file updates made through the mapped buffer. MappedByteBuffer doesn't give you the option of not flushing file metadata - it's always flushed too. Note, that the same considerations regarding nonlocal filesystems apply here as they do for FileChannel.force
So, yes. You need to call MappedByteBuffer.force!
BUT then I found this bug which indicates that both calls could be still necessary at least on Windows.
Tim Bray's article "Saving Data Safely" left me with open questions. Today, it's over a month old and I haven't seen any follow-up on it, so I decided to address the topic here.
One point of the article is that FileDescriptor.sync() should be called to be on the safe side when using FileOutputStream. At first, I was very irritated, because I never have seen any Java code doing a sync during the 12 years I do Java. Especially since coping with files is a pretty basic thing. Also, the standard JavaDoc of FileOutputStream never hinted at syncing (Java 1.0 - 6). After some research, I figured ext4 may actually be the first mainstream file system requiring syncing. (Are there other file systems where explicit syncing is advised?)
I appreciate some general thoughts on the matter, but I also have some specific questions:
When will Android do the sync to the file system? This could be periodic and additionally based on life cycle events (e.g. an app's process goes to the background).
Does FileDescriptor.sync() take care of syncing the meta data? That is syncing the directory of the changed file. Compare to FileChannel.force().
Usually, one does not directly write into the FileOutputStream. Here's my solution (do you agree?):
FileOutputStream fileOut = ctx.openFileOutput(file, Context.MODE_PRIVATE);
BufferedOutputStream out = new BufferedOutputStream(fileOut);
try {
out.write(something);
out.flush();
fileOut.getFD().sync();
} finally {
out.close();
}
Android will do the sync when it needs to -- such as when the screen turns off, shutting down the device, etc. If you are just looking at "normal" operation, explicit sync by applications is never needed.
The problem comes when the user pulls the battery out of their device (or does a hard reset of the kernel), and you want to ensure you don't lose any data.
So the first thing to realize: the issue is when power is suddenly lost, so a clean shutdown can not happen, and the question of what is going to happen in persistent storage at that point.
If you are just writing a single independent new file, it doesn't really matter what you do. The user could have pulled the battery while you were in the middle of writing, right before you started writing, etc. If you don't sync, it just means there is some longer time from when you are done writing during which pulling the battery will lose the data.
The big concern here is when you want to update a file. In that case, when you next read the file you want to have either the previous contents, or the new contents. You don't want to get something half-way written, or lose the data.
This is often done by writing the data in a new file, and then switching to that from the old file. Prior to ext4 you knew that, once you had finished writing a file, further operations on other files would not go on disk until the ones on that file, so you could safely delete the previous file or otherwise do operations that depend on your new file being fully written.
However now if you write the new file, then delete the old one, and the battery is pulled, when you next boot you may see that the old file is deleted and new file created but the contents of the new file is not complete. By doing the sync, you ensure that the new file is completely written at that point so can do further changes (such as deleting the old file) that depend on that state.
fileOut.getFD().sync(); should be on the finally clause, before the close().
sync() is way more important than close() considering durability.
So, everytime you want to 'finish' working on a file you should sync() it before close()ing it.
posix does not guarantee that pending writes will be written to disk when you issue a close().
When I list files of a directory that has 300,000 files with Java, out of memory occurs.
String[] fileNames = file.list();
What I want is a way that can list all files of a directory incrementally no matter how many files in that specific directory and won't have "out of memory" problem with the default 64M heap limit.
I have Google a while, and cannot find such a way in pure Java.
Please help me!!
Note, JNI is a possible solution, but I hate JNI.
I know you said "with the default 64M heap limit", but let's look at the facts - you want to hold a (potentially) large number of items in memory, using the mechanisms made available to you by Java. So, unless there is some dire reason that you can't, I would say increasing the heap is the way to go.
Here is a link to the same discussion at JavaRanch: http://www.coderanch.com/t/381939/Java-General/java/iterate-over-files-directory
Edit, in response to comment: the reason I said he wants to hold a large number of items in memory is because this is the only mechanism Java provides for listing a directory without using the native interface or platform-specific mechanisms (and the OP said he wanted "pure Java").
The only possible solution for you is Java7 and then you sould use an iterator.
final Path p = FileSystems.getDefault().getPath("Yourpath");
Files.walk(p).forEach(filePath -> {
if (Files.isRegularFile(filePath)) {
//Do something with filePath
}
});
You are a bit out of luck here. In the least there will need to be created 300k strings. With an average length of 8-10 char and 2 bytes per char thats 6Mb in the minimum. Add object pointer overhead per string (8 bytes) and you run into your memory limit.
If you absolutely must have that many files in a single dir, which i would not recommend as your file system will have problems, your best bet is to run a native process (not JNI) via Runtime.exec. Keep in mind that you will tie yourself down to the OS (ls vs dir). You will be able to get a list of files as one large string and will be responsible for post processing it into what you want.
Hope this helps.
Having 300 000 files in a dir is not a good idea - AFAIK filesystems are not good at having that many sub-nodes in a single node. Interesting question, though.
EDIT: THE FOLLOWING DOES NOT HELP, see comments.
I think you could use a FileFilter, reject all files, and process them in the filter.
new File("c:/").listFiles( new FileFilter() {
#Override public boolean accept(File pathname) {
processFile();
return false;
}
});
If you can write your code in Java 7 or up, then following is a good option.
Files.newDirectoryStream(Path dir)
Here is the java doc for the API.
Hope this helps.