WatchService performance with many directories - java

I want to use the Java WatchService to listen for changes on a big number of directories (many hundreds of thousands) but I don't know if it is appropriate for such numbers of watched directories.
Does anyone have experience with WatchService with such numbers of directories?
If it helps, the WatchService will be used on CentOS 6.5 with an EXT4 file system.
Thanks,
Mickael

This situation is fairly common for IDEs. They often use directory watching for complex directory structures and many 10s of thousands of files.
There is two things to note:
On Linux you often need to tune the OS to monitor so many files. https://confluence.jetbrains.com/display/IDEADEV/Inotify+Watches+Limit
To prevent this situation it is recommended to increase the watches limit (to, say, 512K). You can do it by adding following line to the /etc/sysctl.conf file:
fs.inotify.max_user_watches = 524288
This example tunes the system to monitor 512k files.
If you have an HDD, it won't make it spin any faster and it most likely does 80 - 120 IOPS (IO Per Second) and this is more likely to be a performance bottleneck than you might like.
Like many IO operations in Java, it is wrapper around a facility which is actually implemented by the OS.

Related

Processing large number of text files in java

I am working on an application which has to read and process ~29K files (~500GB) everyday. The files will be in zipped format and available on a ftp.
What I have done: I plan to download and the files from ftp, unzip it and process using multi-threading, which has reduced the processing time significantly (when number of active threads are fixed to a smaller number). I've written some code and tested it for ~3.5K files(~32GB). Details here: https://stackoverflow.com/a/32247100/3737258
However, the estimated processing time, for ~29K files, still seems to be very high.
What I am looking for: Any suggestion/solution which could help me bring the processing time of ~29K files, ~500GB, to 3-4 hours.
Please note that, each files have to be read line by line and each line has to be written to a new file with some modification(some information removed and some new information be added).
You should profile your application and see where the current bottleneck is, and fix that. Proceed until you are at your desired speed or cannot optimize further.
For example:
Maybe you unzip to disk. This is slow, to do it in memory.
Maybe there is a load of garbage collection. See if you can re-use stuff
Maybe the network is the bottleneck.. etc.
You can, for example, use visualvm.
It's hard to provide you one solution for your issue, since it might be that you simply reached the hardware limit.
Some Ideas:
You can parallelize the process which is necessary to process the read information. There you could provide multiple read lines to one thread (out of a pool), which processes these sequentially
Use java.nio instead of java.io see: Java NIO FileChannel versus FileOutputstream performance / usefulness
Use a profiler
Instead of the profiler, simply write log messages and measure the
duration in multiple parts of your application
Optimize the Hardware (use SSD drives, expiriment with block size, filesystem, etc.)
If you are interested in parallel computing then please try Apache spark it is meant to do exactly what you are looking for.

Creating a large temporary file in a platform-agnostic way

What's the best way of creating a large temporary file in Java, and being sure that it's on disk, not in RAM somewhere?
If I use
Path tempFile = Files.createTempFile("temp-file-name", ".tmp");
then it works fine for small files, but on my Linux machine, it ends up being stored in /tmp. On many Linux boxes, that's a tmpfs filesystem, backed by RAM, which will cause trouble if the file is large. The appropriate way of doing this on such a box is to put it in /var/tmp, but hard-coding that path doesn't seem very cross-platform to me.
Is there a good cross-platform way of creating a temporary file in Java and being sure that it's backed by disk and not by RAM?
There is no platform-independent way to determine free disk space. Actually there is not even a good platform-dependent way; things that happen are zfs filesystems (which may be compressing your data on the fly), directories that are being filled by other applications, or network shares that are simply lying to you.
I know of these options:
Assume that it is an operating concern. I.e. whoever uses the software should have an administrator who is aware of how much space is left on what device, and who expects to be able to explicitly configure the partition that should hold the data. I'd start considering this at several tens of GB, and prefer this at a few 100 GBs.
Assume it's really a temporary file. Document that the application needs xxx GB of temporary space (whatever rough estimate you can give them - my application says "needs ca. 100 GB for every automatic update that you keep on disk").
Abuse the user cache for the file. The XDG standard has $XDG_CACHE_HOME for the cache; the cache directory is supposed to be nice and big (take a look at the ~/.cache/ of anybody using a Linux machine). On Windows, you'd simply use %TEMP% but that's okay because %TEMP% is supposed to be big anyway.
This gives the following strategy: Try environment variables, first XDG_CACHE_HOME (if it's nonempty, it's a Posix system with XDG conventions), then TMP (if it's nonempty, it's a Posix system and you don't have a better option than /tmp anyway), finally TEMP in case it's Windows.

Delete files / folders with Java 7

I am getting killed on performance with file / folder deletes in Java.
The code is quite old and I am wondering if Java 7 (which I upgraded to) actually offers performance improvements, or just another syntax. (I don't want to retool everything unless there is a benefit). I regularly need to extract large ZIPs and then delete the contents and the recursion time is brutal.
I am also stuck on Windows.
Thanks
I would to suggest to use some kind of jar already provided by community.
For example, common-io.x-x.jar, spring-core.jar
Eg, org.apache.commons.io.FileUtils;
FileUtils.copyDirectory(from, to);
FileUtils.deleteDirectory(childDir);
FileUtils.forceDelete(springConfigDir);
FileUtils.writeByteArrayToFile(file, data);
org.springframework.util.FileSystemUtils;
FileSystemUtils.copyRecursively(from, to);
FileSystemUtils.deleteRecursively(dir);
File IO is very dependant on the performance of your hardware. Many HDD can perform 80 - 120 IOPS per second. If you want to open a file you can read up to 120 files per second. To delete a file it can require two updates or up to 60 files deleted per second. With these constraints there is almost nothing you can do in software which will make any difference.
If you have an SSD however, these can do 80,000 to 230,000 IOPS per second (more than a thousand fold increase) At this point what you do software might make a difference, but as you are dealing with compressed files, it most like that CPU will be your bottleneck at this point.

Linux (Ubuntu 12.04) getting horrible multithreaded performance?

I have a multithreaded file converter that I'm working on. On Windows, it puts each file that's being converted in its own thread and uses 100% CPU (on all cores) all the time. It's awesome! On Ubuntu, I get 100% on the first core and ~10% on all the rest. The performance is poor and disappointing.
I'm using Threads, all within a SwingWorker so I don't freeze the GUI. I use thread.join on all threads so I perform a certain task when all threads are complete. I have not changed the code between OS's. Is there a feasible way to fix this?
It was very dumb and I don't quite understand why, but shortly after I posted this, I transferred all of my files to my Ubuntu partition, and it's just as fast (if not faster) than the Windows one. Not sure why moving files would make it go faster? Perhaps my real issue was that since they were on different file systems, my bottleneck was I/O. Converting just one file from the NTFS partition took 3x longer than if I moved it to the ext4 partition. (And Yes, these are all on the same SSD)

Java I/O consumes more CPU resource

I am trying to create 100 files using FileOutputStream/BufferedOutputStream.
I can see the CPU utilization is 100% for 5 to 10 sec. The Directory which i am writing is empty. I am creating PDF files thru iText. Each file having round 1 MB. I am running on Linux.
How can i rewrite the code so that i can minimize the CPU utilization?
Don't guess: profile your application.
If the numbers show that a lot of time is spent in / within write calls, then look at ways to do faster I/O. But if most time is spent in formatting stuff for output (e.g. iText rendering), then that's where you need to focus your efforts.
Is this in a directory which already contains a lot of files? If so, you may well just be seeing the penalty for having a lot of files in a directory - this varies significantly by operating system and file system.
Otherwise, what are you actually doing while you're creating the files? Where does the data come from? Are they big files? One thing you might want to do is try writing to a ByteArrayOutputStream instead - that way you can see how much of the activity is due to the file system and how much is just how you're obtaining/writing the data.
It's a long shot guess, but even if you're using buffered streams make sure you're not writing out a single byte at a time.
The .read(int) and .write(int) methods are CPU killers. You should be using .read(byte[]...) and .write(byte[], int, int) for certain.
A 1MB file to write is large enough to use a java.nio FileChannel and see large performance improvements over java.io. Rewrite your code, and measure it agaist the old stuff. I predict a 2x improvement, at a minimum.
You're unlikely to be able to reduce the CPU load for your task, especially on a Windows system. Java on Linux does support Asynchronous File I/O, however, this can seriously complicate your code. I suspect you are running on Windows, as File I/O generally takes much more time on Windows than it does on Linux. I've even heard of improvements by running Java in a linux VM on Windows.
Take a look at your Task Manager when the process is running, and turn on Show Kernel Times. The CPU time spent in user space can generally be optimized, but the CPU time in kernel space can usually only be reduce by make more efficient calls.
Update -
JSR 203 specifically addresses the need for asynchronous, multiplexed, scatter/gather file IO:
The multiplexed, non-blocking facility introduced by JSR-51 solved much of that problem for network sockets, but it did not do so for filesystem operations.
Until JSR-203 becomes part of Java, you can get true asynchronous IO with the Apache MINA project on Linux.
Java NIO (1) allows you to do Channel based I/O. This is an improvement in performance, but your only doing a buffer of data at a time, and not true async & multiplexed IO.

Categories