Why does Files.deleteIfExists take so long for large files?

Why does Files.deleteIfExists take so long for large files? - java

On a large file (here 35GB):
Files.deleteIfExists(Path.get("large.csv"));
The deletion using java takes >60s. Deleting with rm large.csv on the console just a moment.
Why? Can I speed up large file deletion from within java?

I would blame this on the operating system. On both Windows and Linux, Java simply calls a method provided by the OS-provided C native runtime libraries to delete the file.
(Check the OpenJDK source code.)
So why might it take a long time for the operating system to delete a large file?
A typical file system keeps a map of the disk blocks that are free versus in-use. If you are freeing a really large file, a large number of blocks are being freed, so a large number of bits in the free map need to be updated and written to disk.
A typical file system uses a tree-based index structure to map file offsets to disk blocks. If a file is large enough, the index structure may span multiple disk blocks. When a file is deleted, the entire index needs to be scanned to figure all of the blocks containing data that need to be freed.
These costs are magnified if the file is badly fragmented, and the index blocks and free map blocks are widely scattered.
Deleting a file is typically done synchronously. At least, all of the disk blocks are marked as free before the syscall returns. (If you don't do that, the user is liable to complain that deleting files doesn't work.)
In short, when you delete a huge file, there is a lot of "disk" I/O to do. The operating system does this, not Java.
So why would deleting a file be faster from the command line?
One possible reason is that maybe the rm command you using is actually just moving the deleted file to a Trash folder. That is actually a rename operation, and it is much faster than a real delete.
Note: that's not the normal behavior of rm on Linux.
Another possible reason (on Linux) is that the index and free map blocks for the file that you were deleting were in the buffer cache in one test scenario and not in the other. (If your machine has lost of spare RAM, the Linux OS will cache disk blocks in RAM to improve performance. It is pretty effect.)

Related

Processing large number of text files in java

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).

You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.

It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)

If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

Creating a large temporary file in a platform-agnostic way

What's the best way of creating a large temporary file in Java, and being sure that it's on disk, not in RAM somewhere?
If I use
Path tempFile = Files.createTempFile("temp-file-name", ".tmp");
then it works fine for small files, but on my Linux machine, it ends up being stored in /tmp. On many Linux boxes, that's a tmpfs filesystem, backed by RAM, which will cause trouble if the file is large. The appropriate way of doing this on such a box is to put it in /var/tmp, but hard-coding that path doesn't seem very cross-platform to me.
Is there a good cross-platform way of creating a temporary file in Java and being sure that it's backed by disk and not by RAM?

There is no platform-independent way to determine free disk space. Actually there is not even a good platform-dependent way; things that happen are zfs filesystems (which may be compressing your data on the fly), directories that are being filled by other applications, or network shares that are simply lying to you.
I know of these options:
Assume that it is an operating concern. I.e. whoever uses the software should have an administrator who is aware of how much space is left on what device, and who expects to be able to explicitly configure the partition that should hold the data. I'd start considering this at several tens of GB, and prefer this at a few 100 GBs.
Assume it's really a temporary file. Document that the application needs xxx GB of temporary space (whatever rough estimate you can give them - my application says "needs ca. 100 GB for every automatic update that you keep on disk").
Abuse the user cache for the file. The XDG standard has $XDG_CACHE_HOME for the cache; the cache directory is supposed to be nice and big (take a look at the ~/.cache/ of anybody using a Linux machine). On Windows, you'd simply use %TEMP% but that's okay because %TEMP% is supposed to be big anyway.
This gives the following strategy: Try environment variables, first XDG_CACHE_HOME (if it's nonempty, it's a Posix system with XDG conventions), then TMP (if it's nonempty, it's a Posix system and you don't have a better option than /tmp anyway), finally TEMP in case it's Windows.

Cost of rename, delete or change path a file

What is the cost of the delete, rename, and move file operations? Which one is the fastest?
I want to use java and the files are maintained by the linux operating system.

It is not possible to say which is faster in general, because the relative performance depends on a variety of factors. And it is probably irrelevant ... because they do different things and typically are not interchangeable.
However:
Rename and move are typically equivalent if the source and destination locations are in the same file system.
If move involves moving between file systems it is probably the most expensive. O(N) bytes must be copied.
Otherwise, delete is probably the most expensive. The OS needs to update the parent directory AND mark all of the disc blocks used by the file as free.
The actual costs also depend on the operating systems and the type of file system(s) involved, and (in some cases) on the size of the files involved - see above.

It is dependent on the implementation details of the file system. In most fileSystems it should be an order one, O(1), operation.

Renaming a file is basically just changing the path in a localized way, so it should be as fast as changing the path. Deleting really just means deleting a reference, so it should be fairly fast as well.
The only case where you should see a significant increase in operation cost is for copying the file or for changing the path to an other partition/disk. These cases would actually require the file system to copy the file block by block.
How long it actually takes will heavily depend on the file system you are using (ext3, ext4, FAT, ...) and of course on the speed of your hard disks and hard disk connections (i.e. your motherboard).
If you need a definitive answer on your question I don't think you could avoid benchmarking it yourself using your specific test setup.

Handle a great number of files

I have a external disk with a billion files. If I mount the external disk in computer A, my program will scan all files' path and save the files' path in a database table. After that, when I eject the external disk, those data will still remain in the table. The problem is, if some files are deleted in the computer B, and I mount it to the computer A again, I must synchronize the database table in computer A. However, I don't want to scan all the files again because it takes a lots time and waste a lots memory. Is there any way to update the database table without scanning all files whilst minimizing the memory used?
Besides, in my case, memory limitation is more important than time. Which means I would rather to save more memory than save more time.
I think I can cut the files to a lot of sections and use some specific function (may be SHA1?) to check whether the files in this section are deleted. However, I cannot find out a way to cut the files to the sections. Can anyone help me or give me better ideas?

If you don't have control over the file system on the disk you have no choice but to scan the file names on the entire disk. To list the files that have been deleted you could do something like this:
update files in database: set "seen on this scan" to false
for each file on disk do:
insert/update database, setting "seen on this scan" to true
done
deleted files = select from files where "seen on this scan" = false
A solution to the db performance problem could be accumulating the file names into a list of some kind and do a bulk insert/update whenever you reach, say, 1000 files.
As for directories with 1 billion files, you just need to replace the code that lists the files with something that wraps the C functions opendir and readdir. If I were you wouldn't worry about it too much for now. No sane person has 1 billion files in one directory because that sort of thing cripples file systems and common OS tools, so the risk is low and the solution is easy.

In theory, you could speed things up by checking "modified" timestamps on directories. If a directory has not been modified, then you don't need to check any of the files in that directory. Unfortunately, you do need to scan possible subdirectories, and finding them involves scanning the directory ... unless you've saved the directory tree structure.
And of course, this is moot it you've got a flat directory containing a billion files.
I imagine that you are assembling all of the filepaths in memory so that you can sort them before querying the database. (And sorting them is a GOOD idea ...) However there is an alternative to sorting in memory:
Write the filepaths to a file.
Use an external sort utility to sort the file into primary key order.
Read the sorted file, and perform batch queries against the database in key order.
(Do you REALLY have a billion files on a disc? That sounds like a bad design for your data store ...)

Do you have a list of what's deleted when the delete happens(or change whatever process deletes to create this)? If so couldn't you have a list of "I've been deleted" with a timestamp, and then pick up items from this list to only synchronize on what's changed? Naturally, you would still want to have some kind of batch job to sync during a slow time on the server, but I think that could reduce the load.
Another option may be, depending on what is changing the code, to have that process just update the databases (if you have multiple nodes) directly when it deletes. This would introduce some coupling into the systems, but would be the most efficient way to do it.
The best ways in my opinion are some variation on the idea of messaging that a delete has occurred(even if that's just a file that you write to some where with a list of recently deleted files), or some kind of direct callback mechanism, either through code or by just adjusting the persistent data store the application uses directly from the delete process.
Even with all this said, you would always need to have some kind of index synchronization or periodic sanity check on the indexes to be sure that everything is matched up correctly.
You could (and I would be shocked if you didn't have to based on the number of files that you have) partition off the file space into folders with, say, 5,000-10,000 files per folder, and then create a simple file that has a hash of the names of all the files in the folder. This would catch deletes, but I still think that a direct callback of some form when the delete occurs is a much better idea. If you have a monolithic folder with all this stuff, creating something to break that into separate folders (we used simple number under the main folder so we could go on ad nauseum) should speed everything up greatly; even if you have to do this for all new files and leave the old files in place as is, at least you could stop the bleeding on the file retrieval.
In my opinion, since you are programmatically controlling an index of the files, you should really have the same program involved somehow (or notified) when changes occur at the time of change to the underlying file system, as opposed to allowing changes to happen and then looking through everything for updates. Naturally, to catch the outliers where this communication breaks down, you should also have synchronization code in there to actually check what is in the file system and update the index periodically (although this could and probably should be batched out of process to the main application).

If memory is important I would go for the operation system facilities.
If you have ext4 I will presume you are on Unix (you can install find on other operation systems like Win). If this is the case you could use the native find command (this would be for the last minute, you can of course remember the last scan time and modify this to whatever you like):
find /directory_path -type f -mtime -1 -print
Of course you won't have the deletes. If a heuristic algorithm works for you then you can create a thread that slowly goes to each file stored in your database (whatever you need to display first then from newer to older) and check it is still online. This won't consume much memory. I reckon you won't be able to show a billion files to the user anyway.

how to find out the size of file and directory in java without creating the object?

First please dont overlook because you might think it as common question, this is not. I know how to find out size of file and directory using file.length and Apache FileUtils.sizeOfDirectory.
My problem is, in my case files and directory size is too big (in hundreds of mb). When I try to find out size using above code (e.g. creating file object) then my program becomes so much resource hungry and slows down the performance.
Is there any way to know the size of file without creating object?
I am using
for files File file1 = new file(fileName); long size = file1.length();
and for directory, File dir1 = new file (dirPath); long size = fileUtils.sizeOfDirectiry(dir1);
I have one parameter which enables size computing. If parameter is false then it goes smoothly. If false then program lags or hangs.. I am calculating size of 4 directory and 2 database files.

File objects are very lightweight. Either there is something wrong with your code, or the problem is not with the file objects but with the HD access necessary for getting the file size. If you do that for a large number of files (say, tens of thousands), then the harddisk will do a lot of seeks, which is pretty much the slowest operation possible on a modern PC (by several orders of magnitude).

A File is just a wrapper for the file path. It doesn't matter how big the file is only its file name.
When you want to get the size of all the files in a directory, the OS needs to read the directory and then lookup each file to get its size. Each access takes about 10 ms (because that's a typical seek time for a hard drive) So if you have 100,000 file it will take you about 17 minutes to get all their sizes.
The only way to speed this up is to get a faster drive. e.g. Solid State Drives have an average seek time of 0.1 ms but it would still take 10 second or more to get the size of 100K files.
BTW: The size of each file doesn't matter because it doesn't actually read the file. Only the file entry which has it s size.
EDIT: For example, if I try to get the sizes of a large directory. It is slow at first but much faster once the data is cached.
$ time du -s /usr
2911000 /usr
real 0m33.532s
user 0m0.880s
sys 0m5.190s
$ time du -s /usr
2911000 /usr
real 0m1.181s
user 0m0.300s
sys 0m0.840s
$ find /usr | wc -l
259934
The reason the look up is so fast the fist time is that the files were all installed at once and most of the information is available continuously on disk. Once the information is in memory, it takes next to no time to read the file information.
Timing FileUtils.sizeOfDirectory("/usr") take under 8.7 seconds. This is relatively slow compared with the time it takes du, but it is processing around 30K files per second.
An alterative might be to run Runtime.exec("du -s "+directory); however, this will only make a few seconds difference at most. Most of the time is likely to be spent waiting for the disk if its not in cache.

We had a similar performance problem with File.listFiles() on directories with large number of files.
Our setup was one folder with 10 subfolders each with 10,000 files.
The folder was on a network share and not on the machine running the test.
We were using a FileFilter to only accept files with known extensions or a directory so we could recourse down the directories.
Profiling revealed that about 70% of the time was spent calling File.isDirectory (which I assume Apache is calling). There were two calls to isDirectory for each file (one in the filter and one in the file processing stage).
File.isDirectory was slow cause it had to hit the network share for each file.
Reversing the order of the check in the filter to check for valid name before valid directory saved a lot of time, but we still needed to call isDirectory for the recursive lookup.
My solution was to implement a version of listFiles in native code, that would return a data structure that contained all the metadata about a file instead of just the filename like File does.
This got rid of the performance problem but added a maintenance problem of having to native code maintained by Java developers (lucking we only supported one OS).

I think that you need to read the Meta-Data of a file.
Read this tutorial for more information. This might be the solution you are looking for:
http://download.oracle.com/javase/tutorial/essential/io/fileAttr.html

Answering my own question..
This is not the best solution but works in my case..
I have created a batch script to get the size of the directory and then read it in java program. It gives me less execution time when number of files in directory are more then 1L (That is always in my case).. sizeOfDirectory takes around 30255 ms and with batch script i get 1700 ms.. For less number of files batch script is costly.

I'll add to what Peter Lawrey answered and add that when a directory has a lot of files inside it (directly, not in sub dirs) - the time it takes for file.listFiles() it extremely slow (I don't have exact numbers, I know it from experience). The amount of files has to be large, several thousands if I remember correctly - if this is your case, what fileUtils will do is actually try to load all of their names at once into memory - which can be consuming.
If that is your situation - I would suggest restructuring the directory to have some sort of hierarchy that will ensure a small number of files in each sub-directory.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.