Efficient way to move directories recursively and merge them in Java

Efficient way to move directories recursively and merge them in Java - java

I am looking for the most efficient way to move a directory recursively in Java. At the moment, I am using Apache commons-io as shown in the code below. (If the destDir exists and contains part of the files, I would like those to be overwritten and the nested directory structures to be merged).
FileUtils.copyDirectoryToDirectory(srcDir, destDir);
FileUtils.deleteDirectory(srcDir);
While this does the trick, in my opinion, it isn't efficient enough. There are at least two issues that come to mind:
You will need to have twice as much space.
If this is an SSD, copying the data over to another part of the drive and then erasing the old data will eventually have an impact on the on the hardware, as it will in effect shorten the hard disk's life.
What is the best approach to do this?
As per my understanding commons-io doesn't seem to be using the new Java 7/8 features available in Files. On the other hand, I was unable to get Files.move(...) to work, if the destDir exists (by "get it to work" I mean have it merge the directory structures -- it complains that the destDir exists).
Regarding failures to move (please correct me, if I am wrong):
As far as I understand, an atomic move is one that only succeeds, if all files are moved at once. If I understand this correctly, this means that this is copying first and then deleting. I don't think this is what I'm looking for.
If a certain path/file cannot be moved, then the operation should cease and throw an exception, preserving the current source path it reached.
Please, note that I am not limiting myself to using the commons-io library. I am open to suggestions. I am using Java 8.

This is just an answer to the "what needs to happen to the filesystem" part of the question, not how to do it with Java.
Even if you did want to call out to an external tool, Unix mv is not like Windows Explorer. Same-name directories don't merge. So you will need to implement that yourself, or find a library function that does. There is no single Unix system call that does the whole recursive operation (let alone atomically), so it's something either your code or a library function has to do.
If you need to atomically cut from one version of a tree to another, you need to build a new tree. The files can be hard-links to the old version. i.e. do the equivalent of
cp -al dir/ new
rsync -a /path/to/some/stuff/ new/
# or maybe something smarter / custom that renames instead of copies files.
# your sanity check here
mv dir old &&
mv new dir && # see below for how to make this properly atomic
rm -rf old
This leaves a window where dir doesn't exist. To solve this, add a level of indirection, by making dir a symlink. Symlinks can be replaced atomically with mv (but not ln -sf). So in Java, you want something that will end up doing a rename system call, not an unlink / rename.
Unless you have a boatload of extremely small files (under 100 bytes), the directory metadata operations of building a hardlink farm are much cheaper than a full copy of a directory tree. The file data will stay put (and never even be read), the directory data will be a fresh copy. The file metadata (inodes) will be written for all files (to update the ctime and link count, when creating the hardlink farm, and again when removing the old tree, leaving files with the original link count.
If you're running on a recent Linux kernel, there is a new(2013) system call (called renameat2) available that can exchange two paths atomically. This avoids the symlink level of indirection. Using a Linux-only system call from Java is going to be more trouble than it's worth, though, since symlinks are easy.

I am answering my own question, as I ended up writing my own implementation.
What I didn't like about the implementations of:
Apache Commons IO
Guava
Springframework
for moving files was that all of them first copy the directories and files and then delete them. (As far as I checked, September 2015) They all seem to be stuck with methods from JDK 1.6.
My solution isn't atomic. It handles the moving by walking the directory structure and performing the moves file by file. I am using the new methods from JDK 1.7. It does the job for me and I'm sure other people would like to do the same and wonder how to do it and then waste time. I have therefore created a small Github project, which contains an illustration here:
FileUtils#moveDirectory((Path srcPath, Path destPath)
An illustration of how to use it can be seen here.
If anybody has suggestions on how to improve it, or would like to add features, please feel free to open a pull request.

Traverse the source directory tree:
When meeting a directory, ensure the same directory exist in the target tree (and has the right permissions etc).
When meeting a file, rename it to the same name in the corresponding directory in the target tree.
When leaving a directory, ensure it is empty and delete it.
Consider carefully how any error should be handled.
Note that you might also simply call out to "rsync" if it is available on your system.

Related

Which is the correct way to delete a file so that it is not recoverable?

Currently I am using file.delete() but it is showing a security risk for this as files deleted like this can be recovered by different means. So please provide me a correct way to delete a file. The security risk depicted here is provided by a testing tool called Quixxi and it checks for any vulnerability in app.

The reason a "deleted" file is recoverable is because a delete operation simply unlinks the file in the filesystem, so the directory no longer considers that file part of it. The contents on disk (or whatever storage) still exist on that device.
If you want to guarantee the contents can never be recovered, you have to overwrite the contents first. There are no built-in functions to do this - you'd have to find a library or write the code yourself. Typically you'd write something like all 0s over the file (make sure to flush to media), write all 1s, write a pattern of 01 repeating, 10 repeating, something like that. After you've written with garbage patterns to media (flush) a few times, then you issue the delete.

Not possible in JRE, unfortunately. The JVM is not designed for that, and you need OS-dependent utilities.
The answer by user1676075 contains a mistake. Let's go by steps.
As pointed out already, Java's File.delete method only unlinks the file leaving its contents on disk. It actually invokes the underlying OS APIs to perform this unlink operation.
The problem occurs when you want to overwrite contents in Java.
Java can open a file for overwrite, but will leverage OS utils to do so. And the OS will likely:
Unlink the allocated space on disk
Link the file to a new free area of disk
The result is that you are now writing tons of zeroes... somewhere else!!!
And even if you managed to write zeroes on the same sectors used by the original file, Gutmann method exists for a reason. Gutmann utilities require root/Administrator (Super User) permissions and direct DMA access to precisely control where the writes have to occur.
And with SSDs, things changes. Actually, it might get easier! At this point, I should provide source for SSDs having a CLEAR instructions to replace a sector with zeroes and that privacy-savy disk controllers do that. But maybe pretend you have read nothing.
This will be a sufficient answer for now, because we have demonstrated that there is no out-of-the-box and straightforward way to securely clear a file in Java.
What Java allows, and is called Java Native Interfaces (please also see Java Native Access), is to call native code from Java. So, you got your Gutmann tool in C++ ready? Are you running root? You can write code to invoke Gutmann-ish erasure from Java, but that's a whole other point.
Never tried, but surely feasible

How to watch a complete file system for changes in Java?

Problem description
I would like to watch a complete file system for changes. I'm talking about watching changes in a directory recursively. So, when watching a directory (or a whole file system) all changes in sub-directories need to be captured too. The application needs to be able to track all changes by getting notified.
Java's WatchService isn't suitable
Java already has a WatchService feature, which allows you to monitor a directory for changes. The problem is however, that this isn't a recursive process as far as I know, thus you can't use this to monitor all changes in the root directory of a file system.
Watching all sub-directories explicitly
A solution I've thought of would be to register each directory inside the specified root directory explicitly. The problem with this is however, that walking through, and registering these directories is very resource expensive on a system with more than a million sub-directories. This is because the system would need to go through the whole file system recursively to only register all directories in the first place. The performance impact of this feature would be too big, if it's even possible without crashing the application.
Logical solution
I would assume an operating system would fire/call some sort of event when anything is changed on the file system, that an application is able to listen to. I did however, not find anything like this yet. This would allow the application to listen to all changes without the need to register all sub-directories explicitly. Thus the performance impact with such a method would be minimal.
Question
Is watching a whole file system, or watching a directory recursively possible in Java, and how would this be achieved?

The question should be split into several:
How to track file events across the disk on certain OS
How to use this mechanism in Java
The answer to the first question is that the approaches are different. On Windows there exist Windows API functions that let you do this (and famous FileSystemWatcher class in .NET Framework is a kind of wrapper around this API function set). The more robust method on windows is to create or use a pre-created file system filter driver. On Linux there exists inotify. On MacOS X there exist several approaches (there was a question on this topic somewhere around), none of them being universal or always available.
Also all approaches except a filesystem filter driver are good only for being notified after the event happens, but they don't let you intercept and deny the request (AFAIK, I can be mistaken here).
As for the second question, there seems to be no universal solution that would cover all or most variants that I mentioned above. You would need to first choose the mechanism for each OS, then find some wrappers for Java to use those mechanisms.

Here is an example to to watch a directory (or tree) for changes to file

Please find example https://github.com/syncany/syncany/blob/59cf87c72de4322c737f0073ce8a7ddd992fd898/syncany-lib/src/main/java/org/syncany/operations/watch/RecursiveWatcher.java
Even you can filtered our directory that you don't want to watch

Modification in symbolic link

How to identify which link modified the target file when multiple symbolic link exists for a single file using Java? I was unable to find which file modified the target file.
Example: D:\sample.txt,D:\folder1\sample.txt; these two are links. The target file is located in E:\sample.txt.
Now how to identify whether D:\sample.txt or D:\folder1\sample.txt modified the E:\sample.txt?

How to identify which link modified the target file when multiple symbolic link exists for a single file using java?
It is not possible.
It is not possible in any programming language.
This functionality would have to be supported by the operating system, and no operating system I've ever come across does.
There are heuristics (using timestamps) that will probably work "most of the time", but in each case there are circumstances under which the heuristic will give no answer or even the wrong answer. Here are some of the confounding issues:
With simple timestamp heuristics:
it won't if either of the symlinks is on a read-only file system, or a file system where access times are not recorded (e.g. depending on mount options), and
it won't work if a file read occurs on the symlink after the last file write.
When you add a watcher:
it won't work if you weren't "watching" at the time (duh!), and
it won't work if you have too many watcher events ... and you can't keep up.
(Besides, I don't think you can get events on the use of a symlink. So you would still need to check the symlink access timestamps anyway. And that means that read-only file systems, etc are a problem here too.)
And then there are scenarios like:
both symlinks are used to write the file,
you don't know about all of the symlinks, or
the symlink used for writing has been deleted or "touched".
These are probably beyond of the scope of the OP's use-case. But they are relevant to the general question as set out by the OP's first sentence.

Maybe you can do that using Files.readAttributes(). The below works with Linux, since when you "use" a symlink under Linux, its last access time is modified. No idea under Windows, you'll have to test.
If symlink1 is a Path to your first symlink and symlink2 a Path to your second symlink, and realFile a Path to your real file, then you can retrieve FileTime objects of the last access time for both symlinks and last modification time of the file using:
Files.readAttributes(symlink1, BasicFileAttributes.class).lastAccessTime();
Files.readAttributes(symlink2, BasicFileAttributes.class).lastAccessTime();
Files.readAttributes(realFile, BasicFileAttributes.class).lastModifiedTime();
Since FileTime is Comparable to itself, you may spot which symlink is modified but this is NOT a guarantee.
Explanation: if someone uses symlink1 to modify realFile, then the access time of symlink1 will be modified and the modification time of realFile will be modified. If the last access time of symlink1 is GREATER than the last access time of symlink2, then there is a possibility that symlink1 was used for this operation; on the other hand, if the last access time of symlink2 is greater and the last modification time of realFile is lesser, then you are sure that symlink2 was not used for this purpose.
But again there is no REAL guarantee. Those are only heuristics!
You should also have a look at using a WatchService in order to watch for modifications on the real file; this would make the heuristics above more precise even. But again, no guarantee.

Lock future file

So I have a Samba file server on which my Java app needs to write some files. The thing is that there is also another php application (if a php script is even considered an application) that is aggressively pulling the same directory for new files.
Sometimes, the php script is pulling the file before my Java app is done writing it completely to the disk. Here is a little bit of ascii art to help visualize what I currently have (but doesn't work):
Samba share
/foo (my java app drops file here)
/bar (the directory that the php is pulling)
What I'm currently doing is when the file meets some criterias, it's being moved to /bar and then picked up by the php for more processing. I've tried different thing such has setting the file non writable and non readable before calling renameTo.
I've looked a little bit at FileLocks but it doesn't seem to be able to lock future files. So I am wondering what kind of possiblities I have here? What could I use to lock the file from being picked up before it's fully written without touching the php (because, well, it's php and I don't really have the right to modify it right now).
Thanks
Edit 1
I've got some insight on what the php script is really doing if it can help in any way.
It's reading the directory file in loop (using readdir without sleeping).
As soon as it finds a filename other than "." and "..", it calls file_get_contents and that's where it fails because the file is not completely written to disk (or not even there since the Java code might not even had time to write it between the readdir and file_get_contents.
Edit 2
This Java application is replacing an old php script. When they implemented it, they had the same problem I'm having right now. They solved it by writing the new file in /bar/tmp(with file_put_contents) and then use rename to move it to bar (it looks like rename is supposed to be atomic). And it's been working fine so far. I can't and won't believe that Java can't do something better than what php does...

I think this is due to the fact read locks are shared (multiple process can apply read locks to the same file and read it together).
One approach you can do is to create a separate temporary lock file (eg: /bar/file1.lock) while /bar/file1 hasn't finished copying. Delete the lock file as soon as the file copying is finished.
Then alter the php code to ensure the file isn't being locked before it reads.

You mentioned that you tried FileLock, but keep in mind the disclaimer in the javadoc for that method:
Whether or not a lock actually prevents another program from accessing
the content of the locked region is system-dependent and therefore
unspecified. The native file-locking facilities of some systems are
merely advisory, meaning that programs must cooperatively observe a
known locking protocol in order to guarantee data integrity.
You also mentioned you are using File.renameTo, which also has some caveats (mentioned in the javadoc):
Many aspects of the behavior of this method are inherently
platform-dependent: The rename operation might not be able to move a
file from one filesystem to another, it might not be atomic, and it
might not succeed
Instead of File.renameTo, Try Files.move with the ATOMIC_MOVE option. You'll have to catch AtomicMoveNotSupportedException and possibly fall back to some alternative workaround in case an atomic move is not possible.
You could create a hardlink with Files.createLink(Paths.get('/foo/myFile'), 'Paths.get('/bar/myFile')) then delete the original directory entry (in this example, /foo/myFile.
Failing that, a simple workaround that doesn't require modification to the PHP is to use a shell command or system call to move the file from /foo to /bar. You could, for example, use ProcessBuilder to call mv, or perhaps call ln to create a symlink or hardlink in /bar. You might still have the same problem with mv if /foo and /bar are on different filesystems.
If you have root privileges on the server, you could also try implementing mandatory file locking. I found an example in C, but you could call the C program from Java or adapt the example to Java using JNA (or JNI if you want to punish yourself).

What Tool/Utility can I use to list deleted files on windows?

I am making a Java Desktop Application that is going to "Shred" or "Swipe" or "More Permanently Delete Diles". I can do the swiping but first I have to find and access the deleted files.
Is there some tool or utility that I can use to access deleted files? I could restore them to a temporary location and then shred them. Or is there a way I can do this with Java or the command line?

How do I list files marked as deleted by the Windows delete process using Java or the Command Line?
The short answer is that you can't do it in Java.
The longer answer is that the only way that you could do this in Java would be write a lot of native code to:
access the disk at the disk-block level,
decode the file system data structured to locate deleted files and orphaned blocks that were once part of deleted files, and
zero the relevant blocks.
... while ...
making sure that the blocks haven't been reallocated to another (non-deleted) file, and
taking account of other running processes that may be creating, modifying and deleting files.
Doing all of this is really hard if you are implementing everything in C / C++, and even harder if you are doing it from Java. And if you screw it up, you could trash the PC's file system.
A better idea would be to find some existing tool / utility that does the job and use Runtime.exec(...) or equivalent to run it as a separate process.
(I'll leave it to someone else to suggest possible tools / utilities. The sysinternals sdelete tool doesn't appear to deal with files that have already been deleted.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.