Creating a large temporary file in a platform-agnostic way - java

What's the best way of creating a large temporary file in Java, and being sure that it's on disk, not in RAM somewhere?
If I use
Path tempFile = Files.createTempFile("temp-file-name", ".tmp");
then it works fine for small files, but on my Linux machine, it ends up being stored in /tmp. On many Linux boxes, that's a tmpfs filesystem, backed by RAM, which will cause trouble if the file is large. The appropriate way of doing this on such a box is to put it in /var/tmp, but hard-coding that path doesn't seem very cross-platform to me.
Is there a good cross-platform way of creating a temporary file in Java and being sure that it's backed by disk and not by RAM?

There is no platform-independent way to determine free disk space. Actually there is not even a good platform-dependent way; things that happen are zfs filesystems (which may be compressing your data on the fly), directories that are being filled by other applications, or network shares that are simply lying to you.
I know of these options:
Assume that it is an operating concern. I.e. whoever uses the software should have an administrator who is aware of how much space is left on what device, and who expects to be able to explicitly configure the partition that should hold the data. I'd start considering this at several tens of GB, and prefer this at a few 100 GBs.
Assume it's really a temporary file. Document that the application needs xxx GB of temporary space (whatever rough estimate you can give them - my application says "needs ca. 100 GB for every automatic update that you keep on disk").
Abuse the user cache for the file. The XDG standard has $XDG_CACHE_HOME for the cache; the cache directory is supposed to be nice and big (take a look at the ~/.cache/ of anybody using a Linux machine). On Windows, you'd simply use %TEMP% but that's okay because %TEMP% is supposed to be big anyway.
This gives the following strategy: Try environment variables, first XDG_CACHE_HOME (if it's nonempty, it's a Posix system with XDG conventions), then TMP (if it's nonempty, it's a Posix system and you don't have a better option than /tmp anyway), finally TEMP in case it's Windows.

Related

Why does Files.deleteIfExists take so long for large files?

On a large file (here 35GB):
Files.deleteIfExists(Path.get("large.csv"));
The deletion using java takes >60s. Deleting with rm large.csv on the console just a moment.
Why? Can I speed up large file deletion from within java?
I would blame this on the operating system. On both Windows and Linux, Java simply calls a method provided by the OS-provided C native runtime libraries to delete the file.
(Check the OpenJDK source code.)
So why might it take a long time for the operating system to delete a large file?
A typical file system keeps a map of the disk blocks that are free versus in-use. If you are freeing a really large file, a large number of blocks are being freed, so a large number of bits in the free map need to be updated and written to disk.
A typical file system uses a tree-based index structure to map file offsets to disk blocks. If a file is large enough, the index structure may span multiple disk blocks. When a file is deleted, the entire index needs to be scanned to figure all of the blocks containing data that need to be freed.
These costs are magnified if the file is badly fragmented, and the index blocks and free map blocks are widely scattered.
Deleting a file is typically done synchronously. At least, all of the disk blocks are marked as free before the syscall returns. (If you don't do that, the user is liable to complain that deleting files doesn't work.)
In short, when you delete a huge file, there is a lot of "disk" I/O to do. The operating system does this, not Java.
So why would deleting a file be faster from the command line?
One possible reason is that maybe the rm command you using is actually just moving the deleted file to a Trash folder. That is actually a rename operation, and it is much faster than a real delete.
Note: that's not the normal behavior of rm on Linux.
Another possible reason (on Linux) is that the index and free map blocks for the file that you were deleting were in the buffer cache in one test scenario and not in the other. (If your machine has lost of spare RAM, the Linux OS will cache disk blocks in RAM to improve performance. It is pretty effect.)

Java file copy from mounted network drive slower than OS filemanager copy

Here's the scenario:
We have a NAS with datasets that need to be copied to a local disk for faster processing. Datasets are from 2 to 15GB and each dataset in its own folder on the NAS.
To copy to the local disk, I call:
FileUtils.copyDirectory(nasDir, localDiskDir);
Where the two parameters are File instances.
The nasDir is a network-mapped SMB drive. When using Java to copy the dataset, the max transfer speed tops at about 8MB/s. The same copy using Windows Explorer or Nautilus, depending on the server, reaches up to 34-35MB/s sustained.
Does anyone have an idea of why that is, and, cherry on the cake, how to copy a directory through java faster? Even if we're 5-10% slower than native would be acceptable, the current difference, though, indicates a significant performance degradation somewhere.
EDIT: initially thought it may be related to the Apache Commons I/O library, but testing with https://docs.oracle.com/javase/tutorial/essential/io/examples/Copy.java reveals it to be a more fundamental problem at some level.

Cost of rename, delete or change path a file

What is the cost of the delete, rename, and move file operations? Which one is the fastest?
I want to use java and the files are maintained by the linux operating system.
It is not possible to say which is faster in general, because the relative performance depends on a variety of factors. And it is probably irrelevant ... because they do different things and typically are not interchangeable.
However:
Rename and move are typically equivalent if the source and destination locations are in the same file system.
If move involves moving between file systems it is probably the most expensive. O(N) bytes must be copied.
Otherwise, delete is probably the most expensive. The OS needs to update the parent directory AND mark all of the disc blocks used by the file as free.
The actual costs also depend on the operating systems and the type of file system(s) involved, and (in some cases) on the size of the files involved - see above.
It is dependent on the implementation details of the file system. In most fileSystems it should be an order one, O(1), operation.
Renaming a file is basically just changing the path in a localized way, so it should be as fast as changing the path. Deleting really just means deleting a reference, so it should be fairly fast as well.
The only case where you should see a significant increase in operation cost is for copying the file or for changing the path to an other partition/disk. These cases would actually require the file system to copy the file block by block.
How long it actually takes will heavily depend on the file system you are using (ext3, ext4, FAT, ...) and of course on the speed of your hard disks and hard disk connections (i.e. your motherboard).
If you need a definitive answer on your question I don't think you could avoid benchmarking it yourself using your specific test setup.

how to find out the size of file and directory in java without creating the object?

First please dont overlook because you might think it as common question, this is not. I know how to find out size of file and directory using file.length and Apache FileUtils.sizeOfDirectory.
My problem is, in my case files and directory size is too big (in hundreds of mb). When I try to find out size using above code (e.g. creating file object) then my program becomes so much resource hungry and slows down the performance.
Is there any way to know the size of file without creating object?
I am using
for files File file1 = new file(fileName); long size = file1.length();
and for directory, File dir1 = new file (dirPath); long size = fileUtils.sizeOfDirectiry(dir1);
I have one parameter which enables size computing. If parameter is false then it goes smoothly. If false then program lags or hangs.. I am calculating size of 4 directory and 2 database files.
File objects are very lightweight. Either there is something wrong with your code, or the problem is not with the file objects but with the HD access necessary for getting the file size. If you do that for a large number of files (say, tens of thousands), then the harddisk will do a lot of seeks, which is pretty much the slowest operation possible on a modern PC (by several orders of magnitude).
A File is just a wrapper for the file path. It doesn't matter how big the file is only its file name.
When you want to get the size of all the files in a directory, the OS needs to read the directory and then lookup each file to get its size. Each access takes about 10 ms (because that's a typical seek time for a hard drive) So if you have 100,000 file it will take you about 17 minutes to get all their sizes.
The only way to speed this up is to get a faster drive. e.g. Solid State Drives have an average seek time of 0.1 ms but it would still take 10 second or more to get the size of 100K files.
BTW: The size of each file doesn't matter because it doesn't actually read the file. Only the file entry which has it s size.
EDIT: For example, if I try to get the sizes of a large directory. It is slow at first but much faster once the data is cached.
$ time du -s /usr
2911000 /usr
real 0m33.532s
user 0m0.880s
sys 0m5.190s
$ time du -s /usr
2911000 /usr
real 0m1.181s
user 0m0.300s
sys 0m0.840s
$ find /usr | wc -l
259934
The reason the look up is so fast the fist time is that the files were all installed at once and most of the information is available continuously on disk. Once the information is in memory, it takes next to no time to read the file information.
Timing FileUtils.sizeOfDirectory("/usr") take under 8.7 seconds. This is relatively slow compared with the time it takes du, but it is processing around 30K files per second.
An alterative might be to run Runtime.exec("du -s "+directory); however, this will only make a few seconds difference at most. Most of the time is likely to be spent waiting for the disk if its not in cache.
We had a similar performance problem with File.listFiles() on directories with large number of files.
Our setup was one folder with 10 subfolders each with 10,000 files.
The folder was on a network share and not on the machine running the test.
We were using a FileFilter to only accept files with known extensions or a directory so we could recourse down the directories.
Profiling revealed that about 70% of the time was spent calling File.isDirectory (which I assume Apache is calling). There were two calls to isDirectory for each file (one in the filter and one in the file processing stage).
File.isDirectory was slow cause it had to hit the network share for each file.
Reversing the order of the check in the filter to check for valid name before valid directory saved a lot of time, but we still needed to call isDirectory for the recursive lookup.
My solution was to implement a version of listFiles in native code, that would return a data structure that contained all the metadata about a file instead of just the filename like File does.
This got rid of the performance problem but added a maintenance problem of having to native code maintained by Java developers (lucking we only supported one OS).
I think that you need to read the Meta-Data of a file.
Read this tutorial for more information. This might be the solution you are looking for:
http://download.oracle.com/javase/tutorial/essential/io/fileAttr.html
Answering my own question..
This is not the best solution but works in my case..
I have created a batch script to get the size of the directory and then read it in java program. It gives me less execution time when number of files in directory are more then 1L (That is always in my case).. sizeOfDirectory takes around 30255 ms and with batch script i get 1700 ms.. For less number of files batch script is costly.
I'll add to what Peter Lawrey answered and add that when a directory has a lot of files inside it (directly, not in sub dirs) - the time it takes for file.listFiles() it extremely slow (I don't have exact numbers, I know it from experience). The amount of files has to be large, several thousands if I remember correctly - if this is your case, what fileUtils will do is actually try to load all of their names at once into memory - which can be consuming.
If that is your situation - I would suggest restructuring the directory to have some sort of hierarchy that will ensure a small number of files in each sub-directory.

Solaris: virtual slices/disks for use with ZFS

This is a little related to my previous question Solaris: Mounting a file system on an application's handlers except this question is for a different purpose and is simpler as there is no open/close/lock it is just a fixed length block of bytes with read/write operations.
Is there anyway I can create a virtual slice, kinda like a RAM disk or a SVM slice.. but I want the reads and writes to go through my app.
I am planning to use ZFS to take multiple of these virtual slices/disks and make them into one larger one for distributed backup storage with snapshots. I really like the compression and stacking that ZFS offers. If necessary I can guarantee that there is only one instance of ZFS accessing these virtual disks at a time (to prevent cache conflicts and such). If the one instance goes down, we can make sure it won't start back up and then we can start another instance of that ZFS.
I am planning to have those disks in chunks of about 4GB or so,, then I can move around each chunk and decide where to store them (multiple times mirrored of course) and then have ZFS access the chunks and put them together in to larger chunks for actual use. Also ZFS would permit adding of these small chunks if necessary to increase the size of the larger chunk.
I am aware there would be extra latency / network traffic if we used my own app in Java, but this is just for backup storage. The production storage is entirely different configuration that does not relate.
Edit: We have a system that uses all the space available and basically when there is not enough space it will remove old snapshots and increase the gaps between old snapshots. The purpose of my proposal is to allow the unused space from production equipment to be put to use at no extra cost. At different times different units of our production equipment will have free space. Also the system I am describing should eliminate any single point of failure when attempting to access data. I am hoping to not have to buy two large units and keep them synchronized. I would prefer just to have two access points and then we can mix large/small units in any way we want and move data around seamlessly.
This is a cross post because this is more software related than sysadmin related The original question is here: https://serverfault.com/questions/212072. it may be a good idea for the original to be closed
One way would be to write a Solaris device driver, precisely a block device one emulating a real disk but that will communicate back to your application instead.
Start with reading the Device Driver Tutorial, then have a look at OpenSolaris source code for real drivers code.
Alternatively, you might investigate modifying Solaris iSCSI target to be the interface with your application. Again, looking at OpenSolaris COMSTAR will be a good start.
It seems that any fixed length file on any file system will do for a block device for use with ZFS. Not sure how reboots work, but I am sure we can get write some boot up commands to work that out.
Edit: The fixed length file would be on a network file system such as NFS.

Categories