What is the cost of the delete, rename, and move file operations? Which one is the fastest?
I want to use java and the files are maintained by the linux operating system.
It is not possible to say which is faster in general, because the relative performance depends on a variety of factors. And it is probably irrelevant ... because they do different things and typically are not interchangeable.
However:
Rename and move are typically equivalent if the source and destination locations are in the same file system.
If move involves moving between file systems it is probably the most expensive. O(N) bytes must be copied.
Otherwise, delete is probably the most expensive. The OS needs to update the parent directory AND mark all of the disc blocks used by the file as free.
The actual costs also depend on the operating systems and the type of file system(s) involved, and (in some cases) on the size of the files involved - see above.
It is dependent on the implementation details of the file system. In most fileSystems it should be an order one, O(1), operation.
Renaming a file is basically just changing the path in a localized way, so it should be as fast as changing the path. Deleting really just means deleting a reference, so it should be fairly fast as well.
The only case where you should see a significant increase in operation cost is for copying the file or for changing the path to an other partition/disk. These cases would actually require the file system to copy the file block by block.
How long it actually takes will heavily depend on the file system you are using (ext3, ext4, FAT, ...) and of course on the speed of your hard disks and hard disk connections (i.e. your motherboard).
If you need a definitive answer on your question I don't think you could avoid benchmarking it yourself using your specific test setup.
Related
On a large file (here 35GB):
Files.deleteIfExists(Path.get("large.csv"));
The deletion using java takes >60s. Deleting with rm large.csv on the console just a moment.
Why? Can I speed up large file deletion from within java?
I would blame this on the operating system. On both Windows and Linux, Java simply calls a method provided by the OS-provided C native runtime libraries to delete the file.
(Check the OpenJDK source code.)
So why might it take a long time for the operating system to delete a large file?
A typical file system keeps a map of the disk blocks that are free versus in-use. If you are freeing a really large file, a large number of blocks are being freed, so a large number of bits in the free map need to be updated and written to disk.
A typical file system uses a tree-based index structure to map file offsets to disk blocks. If a file is large enough, the index structure may span multiple disk blocks. When a file is deleted, the entire index needs to be scanned to figure all of the blocks containing data that need to be freed.
These costs are magnified if the file is badly fragmented, and the index blocks and free map blocks are widely scattered.
Deleting a file is typically done synchronously. At least, all of the disk blocks are marked as free before the syscall returns. (If you don't do that, the user is liable to complain that deleting files doesn't work.)
In short, when you delete a huge file, there is a lot of "disk" I/O to do. The operating system does this, not Java.
So why would deleting a file be faster from the command line?
One possible reason is that maybe the rm command you using is actually just moving the deleted file to a Trash folder. That is actually a rename operation, and it is much faster than a real delete.
Note: that's not the normal behavior of rm on Linux.
Another possible reason (on Linux) is that the index and free map blocks for the file that you were deleting were in the buffer cache in one test scenario and not in the other. (If your machine has lost of spare RAM, the Linux OS will cache disk blocks in RAM to improve performance. It is pretty effect.)
What's the best way of creating a large temporary file in Java, and being sure that it's on disk, not in RAM somewhere?
If I use
Path tempFile = Files.createTempFile("temp-file-name", ".tmp");
then it works fine for small files, but on my Linux machine, it ends up being stored in /tmp. On many Linux boxes, that's a tmpfs filesystem, backed by RAM, which will cause trouble if the file is large. The appropriate way of doing this on such a box is to put it in /var/tmp, but hard-coding that path doesn't seem very cross-platform to me.
Is there a good cross-platform way of creating a temporary file in Java and being sure that it's backed by disk and not by RAM?
There is no platform-independent way to determine free disk space. Actually there is not even a good platform-dependent way; things that happen are zfs filesystems (which may be compressing your data on the fly), directories that are being filled by other applications, or network shares that are simply lying to you.
I know of these options:
Assume that it is an operating concern. I.e. whoever uses the software should have an administrator who is aware of how much space is left on what device, and who expects to be able to explicitly configure the partition that should hold the data. I'd start considering this at several tens of GB, and prefer this at a few 100 GBs.
Assume it's really a temporary file. Document that the application needs xxx GB of temporary space (whatever rough estimate you can give them - my application says "needs ca. 100 GB for every automatic update that you keep on disk").
Abuse the user cache for the file. The XDG standard has $XDG_CACHE_HOME for the cache; the cache directory is supposed to be nice and big (take a look at the ~/.cache/ of anybody using a Linux machine). On Windows, you'd simply use %TEMP% but that's okay because %TEMP% is supposed to be big anyway.
This gives the following strategy: Try environment variables, first XDG_CACHE_HOME (if it's nonempty, it's a Posix system with XDG conventions), then TMP (if it's nonempty, it's a Posix system and you don't have a better option than /tmp anyway), finally TEMP in case it's Windows.
I am using a file and need to update value in java when file is modified.
So, I am thinking to check modified time using lastModified of File class, and if modified read the file and update single property from the file.
My doubt is, is lastModified as heavy as reading single property from the file/reading whole file. Because my test results are showing almost same results.
So is it better to read file and update property from the file everytime or checking lastModified is better option in long run.
Note: This operation is performed every one minute.
Or is there any better option than polling lastModified to check if file has changed. I am using java 6.
Because you are using Java 6, checking the modified date or file contents is your only option (there's another answer that discusses using the newer java.nio.file functionality, and if you have the option of moving to Java 7, you should really, really consider that).
To answer your original question:
You didn't specify the location of the file (i.e. is it on a local disk or a server somewhere else) - I'll respond assuming local disk, but if the file is on a different machine, network latencies and netbios/dfs/whatever-network-file-system-you-use will exacerbate the differences.
Checking modified date on a file involves reading the meta data from disk. Checking the contents of the file require reading the file contents from disk (if the file is small, this will be one read operation. If the file is larger, it could be multiple read operations).
Reading the content of the file will probably involve read/write lock checking. Generally speaking, checking the modified date on the file will not require read/write lock checking (depending on the file system, there may still be consistency locks occurring on the meta data disk page, but those are generally lighter weight than file locks).
If the file changes frequently (i.e. you actually expect it to change every minute), then checking the modified date is just overhead - you are going to read the file contents in most cases anyway. If the file doesn't change frequently, then there would definitely be an advantage to modified date checking if the file was large (and you had to read the entire file to get at your information).
If the file is small, and doesn't change frequently, then it's pretty much a wash. In most cases, the file contents and the file meta data are already going to be paged into RAM - so both operations are a relatively efficient check of contents in RAM.
I personally would do the modified date check just b/c it logically makes sense (and it protects you from the performance hit if the file size ever grows above one disk page) - but if the file changes frequently, then I'd just read the file contents. But really, either way is fine.
And that brings us to the unsolicited advice: my guess is that the performance on this operation isn't a big deal in the greater scheme of things. Even if it took 1000X longer than it does now, it probably still wouldn't impact your application's primary purpose/performance. So my real advice here is just write the code and move on - don't worry about it's performance unless this becomes a bottleneck for your application.
Quoting from The JAVA Tutorials
To implement this functionality, called file change notification, a program must be able to detect what is happening to the relevant directory on the file system. One way to do so is to poll the file system looking for changes, but this approach is inefficient. It does not scale to applications that have hundreds of open files or directories to monitor.
The java.nio.file package provides a file change notification API, called the Watch Service API. This API enables you to register a directory (or directories) with the watch service. When registering, you tell the service which types of events you are interested in: file creation, file deletion, or file modification. When the service detects an event of interest, it is forwarded to the registered process. The registered process has a thread (or a pool of threads) dedicated to watching for any events it has registered for. When an event comes in, it is handled as needed.`
Here are some links which provide some sample source on implementation of this service:-
Link 1
Link 2
Edit:- Thanks to Kevin Day for pointing out in comments, since you are using java 6 this might not work for you. Although there is an alternative available in Apache Commons IO . But have not worked with it, so you have to check it yourself :)
I have a external disk with a billion files. If I mount the external disk in computer A, my program will scan all files' path and save the files' path in a database table. After that, when I eject the external disk, those data will still remain in the table. The problem is, if some files are deleted in the computer B, and I mount it to the computer A again, I must synchronize the database table in computer A. However, I don't want to scan all the files again because it takes a lots time and waste a lots memory. Is there any way to update the database table without scanning all files whilst minimizing the memory used?
Besides, in my case, memory limitation is more important than time. Which means I would rather to save more memory than save more time.
I think I can cut the files to a lot of sections and use some specific function (may be SHA1?) to check whether the files in this section are deleted. However, I cannot find out a way to cut the files to the sections. Can anyone help me or give me better ideas?
If you don't have control over the file system on the disk you have no choice but to scan the file names on the entire disk. To list the files that have been deleted you could do something like this:
update files in database: set "seen on this scan" to false
for each file on disk do:
insert/update database, setting "seen on this scan" to true
done
deleted files = select from files where "seen on this scan" = false
A solution to the db performance problem could be accumulating the file names into a list of some kind and do a bulk insert/update whenever you reach, say, 1000 files.
As for directories with 1 billion files, you just need to replace the code that lists the files with something that wraps the C functions opendir and readdir. If I were you wouldn't worry about it too much for now. No sane person has 1 billion files in one directory because that sort of thing cripples file systems and common OS tools, so the risk is low and the solution is easy.
In theory, you could speed things up by checking "modified" timestamps on directories. If a directory has not been modified, then you don't need to check any of the files in that directory. Unfortunately, you do need to scan possible subdirectories, and finding them involves scanning the directory ... unless you've saved the directory tree structure.
And of course, this is moot it you've got a flat directory containing a billion files.
I imagine that you are assembling all of the filepaths in memory so that you can sort them before querying the database. (And sorting them is a GOOD idea ...) However there is an alternative to sorting in memory:
Write the filepaths to a file.
Use an external sort utility to sort the file into primary key order.
Read the sorted file, and perform batch queries against the database in key order.
(Do you REALLY have a billion files on a disc? That sounds like a bad design for your data store ...)
Do you have a list of what's deleted when the delete happens(or change whatever process deletes to create this)? If so couldn't you have a list of "I've been deleted" with a timestamp, and then pick up items from this list to only synchronize on what's changed? Naturally, you would still want to have some kind of batch job to sync during a slow time on the server, but I think that could reduce the load.
Another option may be, depending on what is changing the code, to have that process just update the databases (if you have multiple nodes) directly when it deletes. This would introduce some coupling into the systems, but would be the most efficient way to do it.
The best ways in my opinion are some variation on the idea of messaging that a delete has occurred(even if that's just a file that you write to some where with a list of recently deleted files), or some kind of direct callback mechanism, either through code or by just adjusting the persistent data store the application uses directly from the delete process.
Even with all this said, you would always need to have some kind of index synchronization or periodic sanity check on the indexes to be sure that everything is matched up correctly.
You could (and I would be shocked if you didn't have to based on the number of files that you have) partition off the file space into folders with, say, 5,000-10,000 files per folder, and then create a simple file that has a hash of the names of all the files in the folder. This would catch deletes, but I still think that a direct callback of some form when the delete occurs is a much better idea. If you have a monolithic folder with all this stuff, creating something to break that into separate folders (we used simple number under the main folder so we could go on ad nauseum) should speed everything up greatly; even if you have to do this for all new files and leave the old files in place as is, at least you could stop the bleeding on the file retrieval.
In my opinion, since you are programmatically controlling an index of the files, you should really have the same program involved somehow (or notified) when changes occur at the time of change to the underlying file system, as opposed to allowing changes to happen and then looking through everything for updates. Naturally, to catch the outliers where this communication breaks down, you should also have synchronization code in there to actually check what is in the file system and update the index periodically (although this could and probably should be batched out of process to the main application).
If memory is important I would go for the operation system facilities.
If you have ext4 I will presume you are on Unix (you can install find on other operation systems like Win). If this is the case you could use the native find command (this would be for the last minute, you can of course remember the last scan time and modify this to whatever you like):
find /directory_path -type f -mtime -1 -print
Of course you won't have the deletes. If a heuristic algorithm works for you then you can create a thread that slowly goes to each file stored in your database (whatever you need to display first then from newer to older) and check it is still online. This won't consume much memory. I reckon you won't be able to show a billion files to the user anyway.
We have an interface with an external system in which we get flat files from them and process those files. At present we run a job a few times a day that checks if the file is at the ftp location and then processes if it exists.
I recently read that it is a bad idea to make use of file systems as a message broker which is why I am putting in this question. Can someone clarify if a situation like this one is a right fitment for the use of some other tool and if so which one?
Ours is a java based application.
The first question you should ask is "is it working?".
If the answer to that is yes, then you should be circumspect about change just because you read it was a bad idea. I've read that chocolate may be bad for you but I'm not giving it up :-)
There are potential problems that you can run into, such as files being deleted without your knowledge, or trying to process files that are only half-transferred (though there are ways to mitigate both of those, such as permissions in the former case, or the use of sentinel files or content checking in the latter case).
Myself, I would prefer a message queueing system such as IBM's MQ or JMS (since that's what they're built for, and they do make life a little easier) but, as per the second paragraph above, only if either:
problems appear or become evident with the current solution; or
you have some spare time and money lying around for unnecessary rework.
The last bullet needs expansion. While the work may be unnecessary (in terms of fixing a non-existent problem), that doesn't necessarily make it useless, especially if it can improve performance or security, or reduce the maintenance effort.
I would use a database to synchronize your files. Have a database that points to the file locations. Put an entry into the database only when the files have been fully transferred. This would ensure that you are picking up completed files. You can poll the database to check if new entries are present instead of polling the file system. A very easy simple set up for a polling mechanism. If you would like to be told when a new file appears on the folder, then you would need to go in for a Message Queue.