Maintain list of processed files to prevent duplicate file processing - java

I am looking for guidance in the design approach for resolving one of the problems that we have in our application.
We have scheduled jobs in our Java application and we use Quartz scheduler for it. Our application can have thousands of jobs that do the following:
Scan a folder location for any new files.
If there is a new file, then kick off the associated workflow to process it.
The requirement is to:
Process only new files.
If any duplicate file arrives (file with the same name), then don't process it.
As of now, we persist the list of the processed files at quartz job metadata. But this solution is not scalable as over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.
What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrive with the same name? Shall we consider the approach of persisting processed file list in the external database instead of job metadata? If we use a single external database table for persisting list of processed files for all those thousands of jobs then the table size may grow huge over the years which doesn't look the best approach (however proper indexing may help in this case).
Any guidance here shall be appreciated. It looks like a common use case to me for applications who continuously process new files - therefore looking for best possible approach to address this concern.

If not processing duplicate files is critical for you, the best way to do it would be by storing the file names in a database. Keep in mind that this could be slow since you would be query for each file name, or have a large query for all the new file names.
That said, if you're willing to process new files which may be a duplicate, there are a number of things that can be done as an alternative:
Move processed files to another folder, so that your folder will always have unprocessed files
Add a custom attribute to your processed files, and process files that do not have that attribute. Be aware that this method is not supported by all file systems. See this answer for more information.
Keep a reference to the time when your last quartz job started, and process new files which were created after that time.

Related

Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.
If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.

How to check tree of directories for changes efficiently after termination and restart of program?

I am writing a program that loads a library of data from the disk. It scans recursively through each folder specified by the user, reads the necessary metadata from each file, and then saves that in the program's library into a data structure that is suitable for display and manipulation by the user.
For a reasonable sized data set, this process takes 5-10 minutes. On the high end I could imagine it taking half an hour.
It also sets up a watcher for each directory in the tree, so if anything is changed after the initial scan while the program is running, that changed file or folder can be re-scanned and the library updated with the new data.
When the program terminates, the library data structure is serialized to disk, and then loaded back in at the beginning of the next session.
This leaves one gap that needed to be addressed -- if files are changed between sessions, there is no way to know about those changes.
The solution currently implemented is, when the program is launched and the persisted data is loaded, to then rescan the entire file structure and compare the scanned information to the loaded data, and if anything is different, to replace it.
Given that the rescan reads the metadata of each file and reloads everything, just to discard it after confirming nothing has changed, this seems like a very inefficient method to me.
Here is my question: I'd like to find some way to shortcut this re-scan process so I don't have to read all of the metadata back in and do a full rescan. Instead, it would be nice if there were a way to ask a folder "have your contents changed at all since the last time I saw you? If so, let me rescan you, otherwise, I won't bother rescanning."
One idea that occurs to me is to take a checksum of the folder's contents and store that in the database, and then compare the hashes during the re-scan.
Before I implement this solution, does anyone have a recommendation on how to accomplish this in a better way (or any advice for how to efficiently take the hash of a directory with java)?
Store timestamp on shutdown, then just do find -mnewer?
The most practical way is to traverse the file tree checking for files with a timestamp newer than when your application stopped. For example
find root-dir -mnewer`
though if you did it that way you may run into race conditions. (It would be better to do it in Java ... as you reinstantiate the watchers.)
There are a couple of caveats:
Scanning a file tree takes time. The larger the tree, the longer it takes. If you are talking millions of files it could take hours, just to look at the timestamps.
Timestamps are not bombproof:
there can be issues if there are "discontinuities" in the system clock, or
there can be issues if some person or program with admin privilege tweaks file timestamps.
One idea that occurs to me is to take a checksum of the folder's contents and store that in the database, and then compare the hashes during the re-scan.
It would take much longer to compute checksums or hashes of files. The only way that would be feasible is if the operating system itself was to automatically compute and record a file checksum or hash each time a file was updated. (That would be a significant performance hit on all file / directory write operations ...)

Hadoop process WARC files

I have a general question about Hadoop file splitting and multiple mappers. I am new to Hadoop and am trying to get a handle on how to setup for optimal performance. My project is currently processing WARC files which are GZIPed.
Using the current InputFileFormat, the file is sent to one mapper and is not split. I understand this is the correct behavior for an encrypted file. Would there be a performance benefit to decrypting the file as an intermediate step before running the job to allow the job to be split and thus use more mappers?
Would that be possible? Does having more mappers create more overhead in latency or is it better to have one mapper? Thanks for your help.
Although WARC files are gzipped they are splittable (cf. Best splittable compression for Hadoop input = bz2?) because every record has its own deflate block. But the record offsets must be known in advance.
But is this really necessary? The Common Crawl WARC files are all about 1 GB in size, it should be processed normally within max. 15 minutes. Given the overhead to launch a map task that's a reasonable time for a mapper to run. Ev., a mapper could also process a few WARC files, but it's important that you have enough splits of the input WARC file list so that all nodes are running tasks. To process a single WARC file on Hadoop would mean a lot of unnecessary overhead.

How to poll a directory and not hit a file-transfer race condition?

I am working on an application that polls a directory for new input files at a defined interval. The general process is:
Input files FTP'd to landing strip directory by another app
Our app wakes up
List files in the input directory
Atomic-move the files to a separate staging directory
Kick off worker threads (via a work-distributing queue) to consume the files from the staging directory
Go to back sleep
I've uncovered a problem where the app will pick up an input file while it is incomplete and still in the middle of being transferred, resulting in a worker thread error, requiring manual intervention. This is a scenario we need to avoid.
I should note the file transfer will complete successfully and the server will get a complete copy, but this will happen to occur after the app has given up due to an error.
I'd like to solve this in a clean way, and while I have some ideas for solutions, they all have problems I don't like.
Here's what I've considered:
Force the other apps (some of which are external to our company) to initially transfer the input files to a holding directory, then atomic-move them into the input directory once they're transferred. This is the most robust idea I've had, but I don't like this because I don't trust that it will always be implemented correctly.
Retry a finite number of times on error. I don't like this because it's a partial solution, it makes assumptions about transfer time and file size that could be violated. It would also blur the lines between a genuinely bad file and one that's just been incompletely transferred.
Watch the file sizes and only pick up the file if its size hasn't changed for a defined period of time. I don't like this because it's too complex in our environment: the poller is a non-concurrent clustered Quartz job, so I can't just persist this info in memory because the job can bounce between servers. I could store it in the jobdetail, but this solution just feels too complicated.
I can't be the first have encountered this problem, so I'm sure I'll get better ideas here.
I had that situation once, we got the other guys to load the files with a different extension, e.g. *.tmp, then after the file copy is completed they rename the file with the extension that my code is polling for. Not sure if that is as easily done when the files are coming in by FTP tho.

Handle a great number of files

I have a external disk with a billion files. If I mount the external disk in computer A, my program will scan all files' path and save the files' path in a database table. After that, when I eject the external disk, those data will still remain in the table. The problem is, if some files are deleted in the computer B, and I mount it to the computer A again, I must synchronize the database table in computer A. However, I don't want to scan all the files again because it takes a lots time and waste a lots memory. Is there any way to update the database table without scanning all files whilst minimizing the memory used?
Besides, in my case, memory limitation is more important than time. Which means I would rather to save more memory than save more time.
I think I can cut the files to a lot of sections and use some specific function (may be SHA1?) to check whether the files in this section are deleted. However, I cannot find out a way to cut the files to the sections. Can anyone help me or give me better ideas?
If you don't have control over the file system on the disk you have no choice but to scan the file names on the entire disk. To list the files that have been deleted you could do something like this:
update files in database: set "seen on this scan" to false
for each file on disk do:
insert/update database, setting "seen on this scan" to true
done
deleted files = select from files where "seen on this scan" = false
A solution to the db performance problem could be accumulating the file names into a list of some kind and do a bulk insert/update whenever you reach, say, 1000 files.
As for directories with 1 billion files, you just need to replace the code that lists the files with something that wraps the C functions opendir and readdir. If I were you wouldn't worry about it too much for now. No sane person has 1 billion files in one directory because that sort of thing cripples file systems and common OS tools, so the risk is low and the solution is easy.
In theory, you could speed things up by checking "modified" timestamps on directories. If a directory has not been modified, then you don't need to check any of the files in that directory. Unfortunately, you do need to scan possible subdirectories, and finding them involves scanning the directory ... unless you've saved the directory tree structure.
And of course, this is moot it you've got a flat directory containing a billion files.
I imagine that you are assembling all of the filepaths in memory so that you can sort them before querying the database. (And sorting them is a GOOD idea ...) However there is an alternative to sorting in memory:
Write the filepaths to a file.
Use an external sort utility to sort the file into primary key order.
Read the sorted file, and perform batch queries against the database in key order.
(Do you REALLY have a billion files on a disc? That sounds like a bad design for your data store ...)
Do you have a list of what's deleted when the delete happens(or change whatever process deletes to create this)? If so couldn't you have a list of "I've been deleted" with a timestamp, and then pick up items from this list to only synchronize on what's changed? Naturally, you would still want to have some kind of batch job to sync during a slow time on the server, but I think that could reduce the load.
Another option may be, depending on what is changing the code, to have that process just update the databases (if you have multiple nodes) directly when it deletes. This would introduce some coupling into the systems, but would be the most efficient way to do it.
The best ways in my opinion are some variation on the idea of messaging that a delete has occurred(even if that's just a file that you write to some where with a list of recently deleted files), or some kind of direct callback mechanism, either through code or by just adjusting the persistent data store the application uses directly from the delete process.
Even with all this said, you would always need to have some kind of index synchronization or periodic sanity check on the indexes to be sure that everything is matched up correctly.
You could (and I would be shocked if you didn't have to based on the number of files that you have) partition off the file space into folders with, say, 5,000-10,000 files per folder, and then create a simple file that has a hash of the names of all the files in the folder. This would catch deletes, but I still think that a direct callback of some form when the delete occurs is a much better idea. If you have a monolithic folder with all this stuff, creating something to break that into separate folders (we used simple number under the main folder so we could go on ad nauseum) should speed everything up greatly; even if you have to do this for all new files and leave the old files in place as is, at least you could stop the bleeding on the file retrieval.
In my opinion, since you are programmatically controlling an index of the files, you should really have the same program involved somehow (or notified) when changes occur at the time of change to the underlying file system, as opposed to allowing changes to happen and then looking through everything for updates. Naturally, to catch the outliers where this communication breaks down, you should also have synchronization code in there to actually check what is in the file system and update the index periodically (although this could and probably should be batched out of process to the main application).
If memory is important I would go for the operation system facilities.
If you have ext4 I will presume you are on Unix (you can install find on other operation systems like Win). If this is the case you could use the native find command (this would be for the last minute, you can of course remember the last scan time and modify this to whatever you like):
find /directory_path -type f -mtime -1 -print
Of course you won't have the deletes. If a heuristic algorithm works for you then you can create a thread that slowly goes to each file stored in your database (whatever you need to display first then from newer to older) and check it is still online. This won't consume much memory. I reckon you won't be able to show a billion files to the user anyway.

Categories