Delete files / folders with Java 7 - java

I am getting killed on performance with file / folder deletes in Java.
The code is quite old and I am wondering if Java 7 (which I upgraded to) actually offers performance improvements, or just another syntax. (I don't want to retool everything unless there is a benefit). I regularly need to extract large ZIPs and then delete the contents and the recursion time is brutal.
I am also stuck on Windows.
Thanks

I would to suggest to use some kind of jar already provided by community.
For example, common-io.x-x.jar, spring-core.jar
Eg, org.apache.commons.io.FileUtils;
FileUtils.copyDirectory(from, to);
FileUtils.deleteDirectory(childDir);
FileUtils.forceDelete(springConfigDir);
FileUtils.writeByteArrayToFile(file, data);
org.springframework.util.FileSystemUtils;
FileSystemUtils.copyRecursively(from, to);
FileSystemUtils.deleteRecursively(dir);

File IO is very dependant on the performance of your hardware. Many HDD can perform 80 - 120 IOPS per second. If you want to open a file you can read up to 120 files per second. To delete a file it can require two updates or up to 60 files deleted per second. With these constraints there is almost nothing you can do in software which will make any difference.
If you have an SSD however, these can do 80,000 to 230,000 IOPS per second (more than a thousand fold increase) At this point what you do software might make a difference, but as you are dealing with compressed files, it most like that CPU will be your bottleneck at this point.

Related

Clojure writing files slowly

We have a Clojure application that takes a dataset (~3000 rows) and writes it to a local file using spit. It works great on the machine where it was written, but on every other machine that pulls down the git code, the write step is agonizingly slow. The process takes seconds on the original machine, but takes upwards of ten minutes on every other machine.
The two primary machines in question (the developer’s machine and mine) are both Manjaro Arch Linux systems with comparable specs and configurations. We are both pulling from the same Git source, and both pulling the same data.
We have confirmed that the code stills runs on my machine, since it completes if I try to write only the first ten lines of the dataset (even that still takes almost a minute).
CPU and RAM are barely touched during the process on both machines and the output filesize is less than a MB.
We get the same problem if we use the Java.io library with clojure.data.csv or dk.ative.docjure.spreadsheet instead of spit.
The abstracted datashape is:
[["Name" "Price"]
["Foo Widget" 100]
["Bar Widget" 200]]
(but of course is greater than 3000 rows)
Any help is appreciated!
Alright, so while we were working on the code example to share, we received some suggestions from another source that have resolved the issue.
We changed the reader to use an input stream instead of reading in the whole file
We wrapped the writer in a doall as was also suggested by #Reut Sharabani
The underlying issue seems to be how each machine was handling laziness
Thanks to everyone that responded!

Processing large number of text files in java

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

WatchService performance with many directories

I want to use the Java WatchService to listen for changes on a big number of directories (many hundreds of thousands) but I don't know if it is appropriate for such numbers of watched directories.
Does anyone have experience with WatchService with such numbers of directories?
If it helps, the WatchService will be used on CentOS 6.5 with an EXT4 file system.
Thanks,
Mickael
This situation is fairly common for IDEs. They often use directory watching for complex directory structures and many 10s of thousands of files.
There is two things to note:
On Linux you often need to tune the OS to monitor so many files. https://confluence.jetbrains.com/display/IDEADEV/Inotify+Watches+Limit
To prevent this situation it is recommended to increase the watches limit (to, say, 512K). You can do it by adding following line to the /etc/sysctl.conf file:
fs.inotify.max_user_watches = 524288
This example tunes the system to monitor 512k files.
If you have an HDD, it won't make it spin any faster and it most likely does 80 - 120 IOPS (IO Per Second) and this is more likely to be a performance bottleneck than you might like.
Like many IO operations in Java, it is wrapper around a facility which is actually implemented by the OS.

how to find out the size of file and directory in java without creating the object?

First please dont overlook because you might think it as common question, this is not. I know how to find out size of file and directory using file.length and Apache FileUtils.sizeOfDirectory.
My problem is, in my case files and directory size is too big (in hundreds of mb). When I try to find out size using above code (e.g. creating file object) then my program becomes so much resource hungry and slows down the performance.
Is there any way to know the size of file without creating object?
I am using
for files File file1 = new file(fileName); long size = file1.length();
and for directory, File dir1 = new file (dirPath); long size = fileUtils.sizeOfDirectiry(dir1);
I have one parameter which enables size computing. If parameter is false then it goes smoothly. If false then program lags or hangs.. I am calculating size of 4 directory and 2 database files.
File objects are very lightweight. Either there is something wrong with your code, or the problem is not with the file objects but with the HD access necessary for getting the file size. If you do that for a large number of files (say, tens of thousands), then the harddisk will do a lot of seeks, which is pretty much the slowest operation possible on a modern PC (by several orders of magnitude).
A File is just a wrapper for the file path. It doesn't matter how big the file is only its file name.
When you want to get the size of all the files in a directory, the OS needs to read the directory and then lookup each file to get its size. Each access takes about 10 ms (because that's a typical seek time for a hard drive) So if you have 100,000 file it will take you about 17 minutes to get all their sizes.
The only way to speed this up is to get a faster drive. e.g. Solid State Drives have an average seek time of 0.1 ms but it would still take 10 second or more to get the size of 100K files.
BTW: The size of each file doesn't matter because it doesn't actually read the file. Only the file entry which has it s size.
EDIT: For example, if I try to get the sizes of a large directory. It is slow at first but much faster once the data is cached.
$ time du -s /usr
2911000 /usr
real 0m33.532s
user 0m0.880s
sys 0m5.190s
$ time du -s /usr
2911000 /usr
real 0m1.181s
user 0m0.300s
sys 0m0.840s
$ find /usr | wc -l
259934
The reason the look up is so fast the fist time is that the files were all installed at once and most of the information is available continuously on disk. Once the information is in memory, it takes next to no time to read the file information.
Timing FileUtils.sizeOfDirectory("/usr") take under 8.7 seconds. This is relatively slow compared with the time it takes du, but it is processing around 30K files per second.
An alterative might be to run Runtime.exec("du -s "+directory); however, this will only make a few seconds difference at most. Most of the time is likely to be spent waiting for the disk if its not in cache.
We had a similar performance problem with File.listFiles() on directories with large number of files.
Our setup was one folder with 10 subfolders each with 10,000 files.
The folder was on a network share and not on the machine running the test.
We were using a FileFilter to only accept files with known extensions or a directory so we could recourse down the directories.
Profiling revealed that about 70% of the time was spent calling File.isDirectory (which I assume Apache is calling). There were two calls to isDirectory for each file (one in the filter and one in the file processing stage).
File.isDirectory was slow cause it had to hit the network share for each file.
Reversing the order of the check in the filter to check for valid name before valid directory saved a lot of time, but we still needed to call isDirectory for the recursive lookup.
My solution was to implement a version of listFiles in native code, that would return a data structure that contained all the metadata about a file instead of just the filename like File does.
This got rid of the performance problem but added a maintenance problem of having to native code maintained by Java developers (lucking we only supported one OS).
I think that you need to read the Meta-Data of a file.
Read this tutorial for more information. This might be the solution you are looking for:
http://download.oracle.com/javase/tutorial/essential/io/fileAttr.html
Answering my own question..
This is not the best solution but works in my case..
I have created a batch script to get the size of the directory and then read it in java program. It gives me less execution time when number of files in directory are more then 1L (That is always in my case).. sizeOfDirectory takes around 30255 ms and with batch script i get 1700 ms.. For less number of files batch script is costly.
I'll add to what Peter Lawrey answered and add that when a directory has a lot of files inside it (directly, not in sub dirs) - the time it takes for file.listFiles() it extremely slow (I don't have exact numbers, I know it from experience). The amount of files has to be large, several thousands if I remember correctly - if this is your case, what fileUtils will do is actually try to load all of their names at once into memory - which can be consuming.
If that is your situation - I would suggest restructuring the directory to have some sort of hierarchy that will ensure a small number of files in each sub-directory.

Java I/O consumes more CPU resource

I am trying to create 100 files using FileOutputStream/BufferedOutputStream.
I can see the CPU utilization is 100% for 5 to 10 sec. The Directory which i am writing is empty. I am creating PDF files thru iText. Each file having round 1 MB. I am running on Linux.
How can i rewrite the code so that i can minimize the CPU utilization?
Don't guess: profile your application.
If the numbers show that a lot of time is spent in / within write calls, then look at ways to do faster I/O. But if most time is spent in formatting stuff for output (e.g. iText rendering), then that's where you need to focus your efforts.
Is this in a directory which already contains a lot of files? If so, you may well just be seeing the penalty for having a lot of files in a directory - this varies significantly by operating system and file system.
Otherwise, what are you actually doing while you're creating the files? Where does the data come from? Are they big files? One thing you might want to do is try writing to a ByteArrayOutputStream instead - that way you can see how much of the activity is due to the file system and how much is just how you're obtaining/writing the data.
It's a long shot guess, but even if you're using buffered streams make sure you're not writing out a single byte at a time.
The .read(int) and .write(int) methods are CPU killers. You should be using .read(byte[]...) and .write(byte[], int, int) for certain.
A 1MB file to write is large enough to use a java.nio FileChannel and see large performance improvements over java.io. Rewrite your code, and measure it agaist the old stuff. I predict a 2x improvement, at a minimum.
You're unlikely to be able to reduce the CPU load for your task, especially on a Windows system. Java on Linux does support Asynchronous File I/O, however, this can seriously complicate your code. I suspect you are running on Windows, as File I/O generally takes much more time on Windows than it does on Linux. I've even heard of improvements by running Java in a linux VM on Windows.
Take a look at your Task Manager when the process is running, and turn on Show Kernel Times. The CPU time spent in user space can generally be optimized, but the CPU time in kernel space can usually only be reduce by make more efficient calls.
Update -
JSR 203 specifically addresses the need for asynchronous, multiplexed, scatter/gather file IO:
The multiplexed, non-blocking facility introduced by JSR-51 solved much of that problem for network sockets, but it did not do so for filesystem operations.
Until JSR-203 becomes part of Java, you can get true asynchronous IO with the Apache MINA project on Linux.
Java NIO (1) allows you to do Channel based I/O. This is an improvement in performance, but your only doing a buffer of data at a time, and not true async & multiplexed IO.

Categories