Handle a great number of files - java

I have a external disk with a billion files. If I mount the external disk in computer A, my program will scan all files' path and save the files' path in a database table. After that, when I eject the external disk, those data will still remain in the table. The problem is, if some files are deleted in the computer B, and I mount it to the computer A again, I must synchronize the database table in computer A. However, I don't want to scan all the files again because it takes a lots time and waste a lots memory. Is there any way to update the database table without scanning all files whilst minimizing the memory used?
Besides, in my case, memory limitation is more important than time. Which means I would rather to save more memory than save more time.
I think I can cut the files to a lot of sections and use some specific function (may be SHA1?) to check whether the files in this section are deleted. However, I cannot find out a way to cut the files to the sections. Can anyone help me or give me better ideas?

If you don't have control over the file system on the disk you have no choice but to scan the file names on the entire disk. To list the files that have been deleted you could do something like this:
update files in database: set "seen on this scan" to false
for each file on disk do:
insert/update database, setting "seen on this scan" to true
done
deleted files = select from files where "seen on this scan" = false
A solution to the db performance problem could be accumulating the file names into a list of some kind and do a bulk insert/update whenever you reach, say, 1000 files.
As for directories with 1 billion files, you just need to replace the code that lists the files with something that wraps the C functions opendir and readdir. If I were you wouldn't worry about it too much for now. No sane person has 1 billion files in one directory because that sort of thing cripples file systems and common OS tools, so the risk is low and the solution is easy.

In theory, you could speed things up by checking "modified" timestamps on directories. If a directory has not been modified, then you don't need to check any of the files in that directory. Unfortunately, you do need to scan possible subdirectories, and finding them involves scanning the directory ... unless you've saved the directory tree structure.
And of course, this is moot it you've got a flat directory containing a billion files.
I imagine that you are assembling all of the filepaths in memory so that you can sort them before querying the database. (And sorting them is a GOOD idea ...) However there is an alternative to sorting in memory:
Write the filepaths to a file.
Use an external sort utility to sort the file into primary key order.
Read the sorted file, and perform batch queries against the database in key order.
(Do you REALLY have a billion files on a disc? That sounds like a bad design for your data store ...)

Do you have a list of what's deleted when the delete happens(or change whatever process deletes to create this)? If so couldn't you have a list of "I've been deleted" with a timestamp, and then pick up items from this list to only synchronize on what's changed? Naturally, you would still want to have some kind of batch job to sync during a slow time on the server, but I think that could reduce the load.
Another option may be, depending on what is changing the code, to have that process just update the databases (if you have multiple nodes) directly when it deletes. This would introduce some coupling into the systems, but would be the most efficient way to do it.
The best ways in my opinion are some variation on the idea of messaging that a delete has occurred(even if that's just a file that you write to some where with a list of recently deleted files), or some kind of direct callback mechanism, either through code or by just adjusting the persistent data store the application uses directly from the delete process.
Even with all this said, you would always need to have some kind of index synchronization or periodic sanity check on the indexes to be sure that everything is matched up correctly.
You could (and I would be shocked if you didn't have to based on the number of files that you have) partition off the file space into folders with, say, 5,000-10,000 files per folder, and then create a simple file that has a hash of the names of all the files in the folder. This would catch deletes, but I still think that a direct callback of some form when the delete occurs is a much better idea. If you have a monolithic folder with all this stuff, creating something to break that into separate folders (we used simple number under the main folder so we could go on ad nauseum) should speed everything up greatly; even if you have to do this for all new files and leave the old files in place as is, at least you could stop the bleeding on the file retrieval.
In my opinion, since you are programmatically controlling an index of the files, you should really have the same program involved somehow (or notified) when changes occur at the time of change to the underlying file system, as opposed to allowing changes to happen and then looking through everything for updates. Naturally, to catch the outliers where this communication breaks down, you should also have synchronization code in there to actually check what is in the file system and update the index periodically (although this could and probably should be batched out of process to the main application).

If memory is important I would go for the operation system facilities.
If you have ext4 I will presume you are on Unix (you can install find on other operation systems like Win). If this is the case you could use the native find command (this would be for the last minute, you can of course remember the last scan time and modify this to whatever you like):
find /directory_path -type f -mtime -1 -print
Of course you won't have the deletes. If a heuristic algorithm works for you then you can create a thread that slowly goes to each file stored in your database (whatever you need to display first then from newer to older) and check it is still online. This won't consume much memory. I reckon you won't be able to show a billion files to the user anyway.

Related

Is it a good idea to put big temp files into a guava cache?

I have recently found out that guavas caches have a weight system that you can use instead of just counting the objects for limiting. I currently have a project where I use multiple >100MB files that are automatically downloaded when needed, since they are normally only used once in a period of ~5 days. However, it can sometimes occur that these files are instead needed 5 times in under 1 hour. However, since this is completely unpredictable, it downloads the files 5 times. Now, my question is: would it be a good idea to make a cache with URL -> File where the objects weight value is equal to its size (I would, of course, register a handler that deletes the actual file when it is removed from the cache), and would the code of guavas cache system properly manage it and delete unused stuff, or would it just throw out the big files when there is too much in the cache and leave the smaller ones always in there?

How to check tree of directories for changes efficiently after termination and restart of program?

I am writing a program that loads a library of data from the disk. It scans recursively through each folder specified by the user, reads the necessary metadata from each file, and then saves that in the program's library into a data structure that is suitable for display and manipulation by the user.
For a reasonable sized data set, this process takes 5-10 minutes. On the high end I could imagine it taking half an hour.
It also sets up a watcher for each directory in the tree, so if anything is changed after the initial scan while the program is running, that changed file or folder can be re-scanned and the library updated with the new data.
When the program terminates, the library data structure is serialized to disk, and then loaded back in at the beginning of the next session.
This leaves one gap that needed to be addressed -- if files are changed between sessions, there is no way to know about those changes.
The solution currently implemented is, when the program is launched and the persisted data is loaded, to then rescan the entire file structure and compare the scanned information to the loaded data, and if anything is different, to replace it.
Given that the rescan reads the metadata of each file and reloads everything, just to discard it after confirming nothing has changed, this seems like a very inefficient method to me.
Here is my question: I'd like to find some way to shortcut this re-scan process so I don't have to read all of the metadata back in and do a full rescan. Instead, it would be nice if there were a way to ask a folder "have your contents changed at all since the last time I saw you? If so, let me rescan you, otherwise, I won't bother rescanning."
One idea that occurs to me is to take a checksum of the folder's contents and store that in the database, and then compare the hashes during the re-scan.
Before I implement this solution, does anyone have a recommendation on how to accomplish this in a better way (or any advice for how to efficiently take the hash of a directory with java)?
Store timestamp on shutdown, then just do find -mnewer?
The most practical way is to traverse the file tree checking for files with a timestamp newer than when your application stopped. For example
find root-dir -mnewer`
though if you did it that way you may run into race conditions. (It would be better to do it in Java ... as you reinstantiate the watchers.)
There are a couple of caveats:
Scanning a file tree takes time. The larger the tree, the longer it takes. If you are talking millions of files it could take hours, just to look at the timestamps.
Timestamps are not bombproof:
there can be issues if there are "discontinuities" in the system clock, or
there can be issues if some person or program with admin privilege tweaks file timestamps.
One idea that occurs to me is to take a checksum of the folder's contents and store that in the database, and then compare the hashes during the re-scan.
It would take much longer to compute checksums or hashes of files. The only way that would be feasible is if the operating system itself was to automatically compute and record a file checksum or hash each time a file was updated. (That would be a significant performance hit on all file / directory write operations ...)

java - File lastModified vs reading the file

I am using a file and need to update value in java when file is modified.
So, I am thinking to check modified time using lastModified of File class, and if modified read the file and update single property from the file.
My doubt is, is lastModified as heavy as reading single property from the file/reading whole file. Because my test results are showing almost same results.
So is it better to read file and update property from the file everytime or checking lastModified is better option in long run.
Note: This operation is performed every one minute.
Or is there any better option than polling lastModified to check if file has changed. I am using java 6.
Because you are using Java 6, checking the modified date or file contents is your only option (there's another answer that discusses using the newer java.nio.file functionality, and if you have the option of moving to Java 7, you should really, really consider that).
To answer your original question:
You didn't specify the location of the file (i.e. is it on a local disk or a server somewhere else) - I'll respond assuming local disk, but if the file is on a different machine, network latencies and netbios/dfs/whatever-network-file-system-you-use will exacerbate the differences.
Checking modified date on a file involves reading the meta data from disk. Checking the contents of the file require reading the file contents from disk (if the file is small, this will be one read operation. If the file is larger, it could be multiple read operations).
Reading the content of the file will probably involve read/write lock checking. Generally speaking, checking the modified date on the file will not require read/write lock checking (depending on the file system, there may still be consistency locks occurring on the meta data disk page, but those are generally lighter weight than file locks).
If the file changes frequently (i.e. you actually expect it to change every minute), then checking the modified date is just overhead - you are going to read the file contents in most cases anyway. If the file doesn't change frequently, then there would definitely be an advantage to modified date checking if the file was large (and you had to read the entire file to get at your information).
If the file is small, and doesn't change frequently, then it's pretty much a wash. In most cases, the file contents and the file meta data are already going to be paged into RAM - so both operations are a relatively efficient check of contents in RAM.
I personally would do the modified date check just b/c it logically makes sense (and it protects you from the performance hit if the file size ever grows above one disk page) - but if the file changes frequently, then I'd just read the file contents. But really, either way is fine.
And that brings us to the unsolicited advice: my guess is that the performance on this operation isn't a big deal in the greater scheme of things. Even if it took 1000X longer than it does now, it probably still wouldn't impact your application's primary purpose/performance. So my real advice here is just write the code and move on - don't worry about it's performance unless this becomes a bottleneck for your application.
Quoting from The JAVA Tutorials
To implement this functionality, called file change notification, a program must be able to detect what is happening to the relevant directory on the file system. One way to do so is to poll the file system looking for changes, but this approach is inefficient. It does not scale to applications that have hundreds of open files or directories to monitor.
The java.nio.file package provides a file change notification API, called the Watch Service API. This API enables you to register a directory (or directories) with the watch service. When registering, you tell the service which types of events you are interested in: file creation, file deletion, or file modification. When the service detects an event of interest, it is forwarded to the registered process. The registered process has a thread (or a pool of threads) dedicated to watching for any events it has registered for. When an event comes in, it is handled as needed.`
Here are some links which provide some sample source on implementation of this service:-
Link 1
Link 2
Edit:- Thanks to Kevin Day for pointing out in comments, since you are using java 6 this might not work for you. Although there is an alternative available in Apache Commons IO . But have not worked with it, so you have to check it yourself :)

How to consistently access a file?

I'm looking for a way to select a file, and a time frame/or until a certain action is performed, and "use" or "read" the file for that amount of time. Just a way to keep other programs from accessing it. The quality assurance department as my company is in need of an application like this and I believe it's possible to make it but I'm not sure how to approach this. Possibly "read" the file over and over until the time is reach or an action is performed?
Any ideas?
For Java, the answer would be to use a FileLock, which maps to the native mechanism of the operating system.
On Linux you can block your file access using system calls like flock.
A rather low-tech alternative can be:
Read the file
Keep it in memory.
Delete the file.
Work with your memory copy.
Dump your memory copy file into a new file with the same name when you have finished.
This second method is limited by the file and your system memory sizes, besides you can loose your file if the system stops working before reaching step 5. It is just a silly alternative to system calls. I would prefer to use system API services such flock.

Technology to transfer data with external system

We have an interface with an external system in which we get flat files from them and process those files. At present we run a job a few times a day that checks if the file is at the ftp location and then processes if it exists.
I recently read that it is a bad idea to make use of file systems as a message broker which is why I am putting in this question. Can someone clarify if a situation like this one is a right fitment for the use of some other tool and if so which one?
Ours is a java based application.
The first question you should ask is "is it working?".
If the answer to that is yes, then you should be circumspect about change just because you read it was a bad idea. I've read that chocolate may be bad for you but I'm not giving it up :-)
There are potential problems that you can run into, such as files being deleted without your knowledge, or trying to process files that are only half-transferred (though there are ways to mitigate both of those, such as permissions in the former case, or the use of sentinel files or content checking in the latter case).
Myself, I would prefer a message queueing system such as IBM's MQ or JMS (since that's what they're built for, and they do make life a little easier) but, as per the second paragraph above, only if either:
problems appear or become evident with the current solution; or
you have some spare time and money lying around for unnecessary rework.
The last bullet needs expansion. While the work may be unnecessary (in terms of fixing a non-existent problem), that doesn't necessarily make it useless, especially if it can improve performance or security, or reduce the maintenance effort.
I would use a database to synchronize your files. Have a database that points to the file locations. Put an entry into the database only when the files have been fully transferred. This would ensure that you are picking up completed files. You can poll the database to check if new entries are present instead of polling the file system. A very easy simple set up for a polling mechanism. If you would like to be told when a new file appears on the folder, then you would need to go in for a Message Queue.

Categories