Java: Watching a directory to move large files

Java: Watching a directory to move large files - java

I have been writing a program that watches a directory and when files are created in it, it changes the name and moves them to a new directory. In my first implementation I used Java's Watch Service API which worked fine when I was testing 1kb files. The problem that came up is that in reality the files getting created are anywhere from 50-300mb. When this happened the watcher API would find the file right away but could not move it because it was still being written. I tried putting the watcher in a loop (which generated exceptions until the file could be moved) but this seemed pretty inefficient.
Since that didn't work, I tried up using a timer that checks the folder every 10s and then moves files when it can. This is the method I ended up going for.
Question: Is there anyway to signal when a file is done being written without doing an exception check or continually comparing the size? I like the idea of using the Watcher API just once for each file instead of continually checking with a timer (and running into exceptions).
All responses are greatly appreciated!
nt

I ran into the same problem today. I my usecase a small delay before the file is actually imported was not a big problem and I still wanted to use the NIO2 API. The solution I choose was to wait until a file has not been modified for 10 seconds before performing any operations on it.
The important part of the implementation is as follows. The program waits until the wait time expires or a new event occures. The expiration time is reset every time a file is modified. If a file is deleted before the wait time expires it is removed from the list. I use the poll method with a timeout of the expected expirationtime, that is (lastmodified+waitTime)-currentTime
private final Map<Path, Long> expirationTimes = newHashMap();
private Long newFileWait = 10000L;
public void run() {
for(;;) {
//Retrieves and removes next watch key, waiting if none are present.
WatchKey k = watchService.take();
for(;;) {
long currentTime = new DateTime().getMillis();
if(k!=null)
handleWatchEvents(k);
handleExpiredWaitTimes(currentTime);
// If there are no files left stop polling and block on .take()
if(expirationTimes.isEmpty())
break;
long minExpiration = min(expirationTimes.values());
long timeout = minExpiration-currentTime;
logger.debug("timeout: "+timeout);
k = watchService.poll(timeout, TimeUnit.MILLISECONDS);
}
}
}
private void handleExpiredWaitTimes(Long currentTime) {
// Start import for files for which the expirationtime has passed
for(Entry<Path, Long> entry : expirationTimes.entrySet()) {
if(entry.getValue()<=currentTime) {
logger.debug("expired "+entry);
// do something with the file
expirationTimes.remove(entry.getKey());
}
}
}
private void handleWatchEvents(WatchKey k) {
List<WatchEvent<?>> events = k.pollEvents();
for (WatchEvent<?> event : events) {
handleWatchEvent(event, keys.get(k));
}
// reset watch key to allow the key to be reported again by the watch service
k.reset();
}
private void handleWatchEvent(WatchEvent<?> event, Path dir) throws IOException {
Kind<?> kind = event.kind();
WatchEvent<Path> ev = cast(event);
Path name = ev.context();
Path child = dir.resolve(name);
if (kind == ENTRY_MODIFY || kind == ENTRY_CREATE) {
// Update modified time
FileTime lastModified = Attributes.readBasicFileAttributes(child, NOFOLLOW_LINKS).lastModifiedTime();
expirationTimes.put(name, lastModified.toMillis()+newFileWait);
}
if (kind == ENTRY_DELETE) {
expirationTimes.remove(child);
}
}

Write another file as an indication that the original file is completed.
I.g 'fileorg.dat' is growing if done create a file 'fileorg.done' and check
only for the 'fileorg.done'.
With clever naming conventions you should not have problems.

Two solutions:
The first is a slight variation of the answer by stacker:
Use a unique prefix for incomplete files. Something like myhugefile.zip.inc instead of myhugefile.zip. Rename the files when upload / creation is finished. Exclude .inc files from the watch.
The second is to use a different folder on the same drive to create / upload / write the files and move them to the watched folder once they are ready. Moving should be an atomic action if they are on the same drive (file system dependent, I guess).
Either way, the clients that create the files will have to do some extra work.

I know it's an old question but maybe it can help somebody.
I had the same issue, so what I did was the following:
if (kind == ENTRY_CREATE) {
System.out.println("Creating file: " + child);
boolean isGrowing = false;
Long initialWeight = new Long(0);
Long finalWeight = new Long(0);
do {
initialWeight = child.toFile().length();
Thread.sleep(1000);
finalWeight = child.toFile().length();
isGrowing = initialWeight < finalWeight;
} while(isGrowing);
System.out.println("Finished creating file!");
}
When the file is being created, it will be getting bigger and bigger. So what I did was to compare the weight separated by a second. The app will be in the loop until both weights are the same.

Looks like Apache Camel handles the file-not-done-uploading problem by trying to rename the file (java.io.File.renameTo). If the rename fails, no read lock, but keep trying. When the rename succeeds, they rename it back, then proceed with intended processing.
See operations.renameFile below. Here are the links to the Apache Camel source: GenericFileRenameExclusiveReadLockStrategy.java and FileUtil.java
public boolean acquireExclusiveReadLock( ... ) throws Exception {
LOG.trace("Waiting for exclusive read lock to file: {}", file);
// the trick is to try to rename the file, if we can rename then we have exclusive read
// since its a Generic file we cannot use java.nio to get a RW lock
String newName = file.getFileName() + ".camelExclusiveReadLock";
// make a copy as result and change its file name
GenericFile<T> newFile = file.copyFrom(file);
newFile.changeFileName(newName);
StopWatch watch = new StopWatch();
boolean exclusive = false;
while (!exclusive) {
// timeout check
if (timeout > 0) {
long delta = watch.taken();
if (delta > timeout) {
CamelLogger.log(LOG, readLockLoggingLevel,
"Cannot acquire read lock within " + timeout + " millis. Will skip the file: " + file);
// we could not get the lock within the timeout period, so return false
return false;
}
}
exclusive = operations.renameFile(file.getAbsoluteFilePath(), newFile.getAbsoluteFilePath());
if (exclusive) {
LOG.trace("Acquired exclusive read lock to file: {}", file);
// rename it back so we can read it
operations.renameFile(newFile.getAbsoluteFilePath(), file.getAbsoluteFilePath());
} else {
boolean interrupted = sleep();
if (interrupted) {
// we were interrupted while sleeping, we are likely being shutdown so return false
return false;
}
}
}
return true;
}

While it's not possible to be notificated by the Watcher Service API when the SO finish copying, all options seems to be 'work around' (including this one!).
As commented above,
1) Moving or copying is not an option on UNIX;
2) File.canWrite always returns true if you have permission to write, even if the file is still being copied;
3) Waits until the a timeout or a new event occurs would be an option, but what if the system is overloaded but the copy was not finished? if the timeout is a big value, the program would wait so long.
4) Writing another file to 'flag' that the copy finished is not an option if you are just consuming the file, not creating.
An alternative is to use the code below:
boolean locked = true;
while (locked) {
RandomAccessFile raf = null;
try {
raf = new RandomAccessFile(file, "r"); // it will throw FileNotFoundException. It's not needed to use 'rw' because if the file is delete while copying, 'w' option will create an empty file.
raf.seek(file.length()); // just to make sure everything was copied, goes to the last byte
locked = false;
} catch (IOException e) {
locked = file.exists();
if (locked) {
System.out.println("File locked: '" + file.getAbsolutePath() + "'");
Thread.sleep(1000); // waits some time
} else {
System.out.println("File was deleted while copying: '" + file.getAbsolutePath() + "'");
}
} finally {
if (raf!=null) {
raf.close();
}
}
}

This is a very interesting discussion, as certainly this is a bread and butter use case: wait for a new file to be created and then react to the file in some fashion. The race condition here is interesting, as certainly the high-level requirement here is to get an event and then actually obtain (at least) a read lock on the file. With large files or just simply lots of file creations, this could require a whole pool of worker threads that just periodically try to get locks on newly created files and, when they're successful, actually do the work. But as I am sure NT realizes, one would have to do this carefully to make it scale as it is ultimately a polling approach, and scalability and polling aren't two words that go together well.

I had to deal with a similar situation when I implemented a file system watcher to transfer uploaded files. The solution I implemented to solve this problem consists of the following:
1- First of all, maintain a Map of unprocessed file (As long as the file is still being copied, the file system generates Modify_Event, so you can ignore them if the flag is false).
2- In your fileProcessor, you pickup a file from the list and check if it's locked by the filesystem, if yes, you will get an exception, just catch this exception and put your thread in wait state (i.e 10 seconds) and then retry again till the lock is released. After processing the file, you can either change the flag to true or remove it from the map.
This solution will be not be efficient if the many versions of the same file are transferred during the wait timeslot.
Cheers,
Ramzi

Depending on how urgently you need to move the file once it is done being written, you can also check for a stable last-modified timestamp and only move the file one it is quiesced. The amount of time you need it to be stable can be implementation dependent, but I would presume that something with a last-modified timestamp that hasn't changed for 15 secs should be stable enough to be moved.

For large file in linux, the files gets copied with a extension of .filepart. You just need to check the extension using commons api and register the ENTRY_CREATE event. I tested this with my .csv files(1GB) and add it worked
public void run()
{
try
{
WatchKey key = myWatcher.take();
while (key != null)
{
for (WatchEvent event : key.pollEvents())
{
if (FilenameUtils.isExtension(event.context().toString(), "filepart"))
{
System.out.println("Inside the PartFile " + event.context().toString());
} else
{
System.out.println("Full file Copied " + event.context().toString());
//Do what ever you want to do with this files.
}
}
key.reset();
key = myWatcher.take();
}
} catch (InterruptedException e)
{
e.printStackTrace();
}
}

If you don't have control over the write process, log all ENTRY_CREATED events and observe if there are patterns.
In my case, the files are created via WebDav (Apache) and a lot of temporary files are created but also two ENTRY_CREATED events are triggered for the same file. The second ENTRY_CREATED event indicates that the copy process is complete.
Here are my example ENTRY_CREATED events. The absolute file path is printed (your log may differ, depending on the application that writes the file):
[info] application - /var/www/webdav/.davfs.tmp39dee1 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.davfs.tmp054fe9 was created
[info] application - /var/www/webdav/document.docx was created
[info] application - /var/www/webdav/.DAV/__db.document.docx was created
As you see, I get two ENTRY_CREATED events for document.docx. After the second event I know the file is complete. Temporary files are obviously ignored in my case.

So, I had the same problem and had the following solution work for me.
Earlier unsuccessful attempt - Trying to monitor the "lastModifiedTime" stat of each file but I noticed that a large file's size growth may pause for some time.(size does not change continuously)
Basic Idea - For every event, create a trigger file(in a temporary directory) whose name is of the following format -
OriginalFileName_lastModifiedTime_numberOfTries
This file is empty and all the play is only in the name. The original file will only be considered after passing intervals of a specific duration without a change in it's "last Modified time" stat. (Note - since it's a file stat, there's no overhead -> O(1))
NOTE - This trigger file is handled by a different service(say 'FileTrigger').
Advantage -
No sleep or wait to hold the system.
Relieves the file watcher to monitor other events
CODE for FileWatcher -
val triggerFileName: String = triggerFileTempDir + orifinalFileName + "_" + Files.getLastModifiedTime(Paths.get(event.getFile.getName.getPath)).toMillis + "_0"
// creates trigger file in temporary directory
val triggerFile: File = new File(triggerFileName)
val isCreated: Boolean = triggerFile.createNewFile()
if (isCreated)
println("Trigger created: " + triggerFileName)
else
println("Error in creating trigger file: " + triggerFileName)
CODE for FileTrigger (cron job of interval say 5 mins) -
val actualPath : String = "Original file directory here"
val tempPath : String = "Trigger file directory here"
val folder : File = new File(tempPath)
val listOfFiles = folder.listFiles()
for (i <- listOfFiles)
{
// ActualFileName_LastModifiedTime_NumberOfTries
val triggerFileName: String = i.getName
val triggerFilePath: String = i.toString
// extracting file info from trigger file name
val fileInfo: Array[String] = triggerFileName.split("_", 3)
// 0 -> Original file name, 1 -> last modified time, 2 -> number of tries
val actualFileName: String = fileInfo(0)
val actualFilePath: String = actualPath + actualFileName
val modifiedTime: Long = fileInfo(1).toLong
val numberOfTries: Int = fileStats(2).toInt
val currentModifiedTime: Long = Files.getLastModifiedTime(Paths.get(actualFilePath)).toMillis
val differenceInModifiedTimes: Long = currentModifiedTime - modifiedTime
// checks if file has been copied completely(4 intervals of 5 mins each with no modification)
if (differenceInModifiedTimes == 0 && numberOfTries == 3)
{
FileUtils.deleteQuietly(new File(triggerFilePath))
println("Trigger file deleted. Original file completed : " + actualFilePath)
}
else
{
var newTriggerFileName: String = null
if (differenceInModifiedTimes == 0)
{
// updates numberOfTries by 1
newTriggerFileName = actualFileName + "_" + modifiedTime + "_" + (numberOfTries + 1)
}
else
{
// updates modified timestamp and resets numberOfTries to 0
newTriggerFileName = actualFileName + "_" + currentModifiedTime + "_" + 0
}
// renames trigger file
new File(triggerFilePath).renameTo(new File(tempPath + newTriggerFileName))
println("Trigger file renamed: " + triggerFileName + " -> " + newTriggerFileName)
}
}

I speculate that java.io.File.canWrite() will tell you when a file has been done writing.

Related

Keep checking for a text in a file until its found or a certain timeout

I need to check for a error log file for certain time after starting a process. Either the word found or the timeout reached, I need to exit informing the text is found or Text is not found until the timeout. I tried like below but couldnt achieve
public void waitFortext(String expectedText,
String filePath){
long timeout = 50000 + System.currentTimeMillis();
File file = new File(filePath);
String content = FileUtils.readFileToString(file, "UTF-8");
boolean available = false;
while (available || System.currentTimeMillis() > timeout) {
available = content.contains(expectedText);
Thread.sleep(500);
if (available) {
return;
}
}
}`

Make a variable oldTime and set it to System.nanoTime when you want the time to start. Make a variable newTime and update it to System.nanoTime in every time the code loops. Compare the difference of these two values to your wanted amount of time, exiting the loop when the difference is greater.

The problem here is you read file only once.
Move line
String content = FileUtils.readFileToString(file, "UTF-8");
inside loop (make it first statement.

Java: Unexpected exit during write

Lets say I have a java program writing out a large JSON taking up some time. Is there a way to determine if the program exits unexpectedly on the next startup to determine if my JSON is corrupted?

Rename the file after your process is complete:
try
{
File outputFile = ...;
someLongRunningProcess( outputFile );
File successfulFile = ...
outputFile.renameTo( successfulFile );
}
catch ( Exception ex )
{
...
}
If you don't have a successfulFile when you restart, your previous run wasn't successful.
Just keep the renameTo operation within a single file system so it's a simple, almost instantaneous atomic operation instead of any implied copy.

Can you, please, explain clearer? Do you mean starting up again after JVM exits? If that is what you meant you will need a file based flag for that.
String expectedFlag="Good Exit";
void afterExit(){
//always overwrite whatever is in this file
//write flag value to a txt file
}
boolean beforeStarting(){
String flagText = readFromFlagFile();
if(expectedFlag.equals(flagText)){
return true;
}
return false;
}

Listing of files inside 7-zip archive takes some seconds to complete

I am trying to use the Apache Commons Compress to read the content of a 7-zip file. I'm not interested in reading/extracting the content, I just want to get the list of all the entries.
I made this code, but with 4MB archives it takes 6 seconds to read the whole file.
public static void main(String[]args) throws IOException{
File sevenz = new File("testfile.7z");
System.out.println("Reading 7-zip...");
SevenZFile sevenZFile = new SevenZFile(sevenz);
long s = System.currentTimeMillis();
SevenZArchiveEntry entry;
while((entry=sevenZFile.getNextEntry())!=null){
System.out.print(entry.isDirectory()?"Dir":"File");
System.out.print("\t");
System.out.print("*********.***"); //entry.getName();
System.out.print("\t");
System.out.println(entry.getHasCrc()?"CRC":"NO-CRC");
}
System.out.println("------------------------------");
System.out.println("7-zip\t"+(System.currentTimeMillis()-s)+" ms to read.");
}
The output is:
Reading 7-zip...
File *********.*** CRC
File *********.*** CRC
File *********.*** CRC
File *********.*** CRC
File *********.*** CRC
------------------------------
7-zip 6236 ms to read.
Is the file listing process supposed to take all this time or am I doing something wrong?
I also tried to remove all the prints, but the time it takes to read the file is the same.

That does seem a little on the high side. The first thing I would do would be to remove extraneous effort and time only the reading portion.
That means commenting out all the System.out.println commands inside the loop:
while ((entry = sevenZFile.getNextEntry()) != null) {
}
System.out.println("total\t" + (System.currentTimeMillis()-s) + " ms.");
Do that and see if it makes a difference. That will tell you whether it's the entry scanning itself or the printing and/or extraction of the data from each entry.
Beyond that, you can find out how long each iteration takes with:
while ((entry = sevenZFile.getNextEntry()) != null) {
long s2 = System.currentTimeMillis();
System.out.println("entry\t" + (s2-s) + " ms.");
s = s2;
}
I have a vague recollection that Apache Commons Compress read the entire list of entries on start and that appears to be the case based on the source code here.
One possibility would be to grab that source code, incorporate it as is in your own code temporarily, then profile it to see where it's spending most of the time during instantiation.

Is java.util.logging.FileHandler in Java 8 broken?

First, a simple test code:
package javaapplication23;
import java.io.IOException;
import java.util.logging.FileHandler;
public class JavaApplication23 {
public static void main(String[] args) throws IOException {
new FileHandler("./test_%u_%g.log", 10000, 100, true);
}
}
This test code creates with Java 7 only one File "test_0_0.log", no matter, how often I run the program. This is the expected behaviour because the append parameter in the constructor is set to true.
But if I run this sample in Java 8, every run creates a new File (test_0_0.log, test_0_1.log, test_0_2.log,...). I think this is a bug.
Imho, the related change in Java is this one:
## -413,18 +428,18 ##
// object. Try again.
continue;
}
- FileChannel fc;
+
try {
- lockStream = new FileOutputStream(lockFileName);
- fc = lockStream.getChannel();
- } catch (IOException ix) {
- // We got an IOException while trying to open the file.
- // Try the next file.
+ lockFileChannel = FileChannel.open(Paths.get(lockFileName),
+ CREATE_NEW, WRITE);
+ } catch (FileAlreadyExistsException ix) {
+ // try the next lock file name in the sequence
continue;
}
+
boolean available;
try {
- available = fc.tryLock() != null;
+ available = lockFileChannel.tryLock() != null;
// We got the lock OK.
} catch (IOException ix) {
// We got an IOException while trying to get the lock.
## -440,7 +455,7 ##
}
// We failed to get the lock. Try next file.
- fc.close();
+ lockFileChannel.close();
}
}
(In full: OpenJDK changeset 6123:ac22a52a732c)
I know that normally the FileHandler gets closed by the Logmanager, but this is not the case, if the system or the application crashes or the process gets killed. This is why I do not have a "close" statement in the above sample code.
Now I have two questions:
1) What is your opinion? Is this a bug? (Almost answered in the following comments and answers)
2) Do you know a workaround to get the old Java 7 behavior in Java 8? (The more important question...)
Thanks for your answers.

Closing of the FileHandler deletes the 'lck' file. If the lock file exists at all under a JDK8 version that is less than update 40 (java.util.logging), the FileHandler is going to rotate. From the OpenJDK discussion, the decision was made to always rotate if the lck file exists in addtion to if the current process can't lock it. The reason given is that it is always safer to rotate when the lock file exists. So this gets really nasty if you have rotating pattern in use with a mix of JDK versions because the JDK7 version will reuse the lock but the JDK8 version will leave it and rotate. Which is what you are doing with your test case.
Using JDK8 if I purge all log and lck files from the working directory and then run:
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("java.runtime.version"));
new FileHandler("./test_%u.log", 10000, 100, true).close();
}
I always see a file named 'test_0.log.0'. I get the same result using JDK7.
Bottom line is that is that you have to ensure your FileHandlers are closed. If it is never garbaged collected or removed from the logger tree then LogManager will close your FileHandler. Otherwise you have to close it. After that is fixed, purge all lock files before running your new patched code. Then be aware that if the JVM process crashed or is killed the lock file won't be deleted. If you have an I/O error on close your lock file won't be deleted. When the next process starts, the FileHandler will rotate.
As you point out, it is possible to use up all of the lock files on JDK8 if the above conditions occur over 100 runs. A simple test for this is to run the following code twice without deleting the log and lck files:
public static void main(String[] args) throws Exception {
System.out.println(System.getProperty("java.runtime.version"));
ReferenceQueue<FileHandler> q = new ReferenceQueue<>();
for (int i=0; i<100; i++) {
WeakReference<FileHandler> h = new WeakReference<>(
new FileHandler("./test_%u.log", 10000, 2, true), q);
while (q.poll() != h) {
System.runFinalization();
System.gc();
System.runFinalization();
Thread.yield();
}
}
}
However, the test case above won't work if JDK-6774110 is fixed correctly. The issue for this can be tracked on the OpenJDK site under RFR: 8048020 - Regression on java.util.logging.FileHandler and FileHandler webrev.

How to check FileLock without truncating file?

I recently added filelocks to my downloader asynctask:
FileOutputStream file = new FileOutputStream(_outFile);
file.getChannel().lock();
and after download completes, file.close() to release lock.
From a called BroadcastReceiver (different thread), I need to go through the files and see which are downloaded and which are still locked. I started with trylock:
for (int i=0; i<files.length; i++) {
try {
System.out.print((files[i]).getName());
test = new FileOutputStream(files[i]);
FileLock lock = test.getChannel().tryLock();
if (lock != null) {
lock.release();
//Not a partial download. Do stuff.
}
} catch (Exception e) {
e.printStackTrace();
} finally {
test.close();
}
}
Unfortunately I read the file is truncated (0 bytes) when the FileOutputStream is created.
I set it to append, but the lock doesn't seem to take effect, all appear to be un-locked (fully downloaded)
Is there another way to check if a write-lock is applied to the file currently, or am I using the wrong methods here? Also, is there a way to debug file-locks, from the ADB terminal or Eclipse?

None of this is going to work. Check the Javadoc. Locks are held on behalf of the entire process, i.e. the JVM, not by individual threads.

My first thought would be to open it for append per the javadocs
test = new FileOutputStream(files[i], true); // the true specifies for append

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: Watching a directory to move large files - java

Write another file as an indication that the original file is completed. I.g 'fileorg.dat' is growing if done create a file 'fileorg.done' and check only for the 'fileorg.done'. With clever naming conventions you should not have problems.

I speculate that java.io.File.canWrite() will tell you when a file has been done writing.

Related

Keep checking for a text in a file until its found or a certain timeout

Java: Unexpected exit during write

Listing of files inside 7-zip archive takes some seconds to complete

Is java.util.logging.FileHandler in Java 8 broken?

How to check FileLock without truncating file?

Categories

Resources