I am writing a little program that creates an index of all files on my directories. It basically iterates over each file on the disk and stores it into a searchable database, much like Unix's locate. The problem is, that index generation is quite slow since I have about a million files.
Once I have generated an index, is there a quick way to find out which files have been added or removed on the disk since the last run?
EDIT: I do not want to monitor the file system events. I think the risk is too high to get out of sync, I would much prefer to have something like a quick re-scan that quickly finds where files have been added / removed. Maybe with directory last modified date or something?
A Little Benchmark
I just made a little benchmark. Running
dir /b /s M:\tests\ >c:\out.txt
Takes 0.9 seconds and gives me all the information I need. When I use a Java implementation (much like this), it takes about 4.5 seconds. Any ideas how to improve at least this brute force approach?
Related posts: How to see if a subfile of a directory has changed
Can you jump out of java.
You could simply use
dir /b /s /on M:\tests\
the /on sorts by name
if you pipe that out to out.txt
Then do a diff to the last time you ran this file either in Java or in a batch file. Something like this in Dos. You'd need to get a diff tool, either diff in cygwin or the excellent http://gnuwin32.sourceforge.net/packages/diffutils.htm
dir /b /s /on m:\tests >new.txt
diff new.txt archive.txt >diffoutput.txt
del archive.txt
ren new.txt archive.txt
Obviously you could use a java diff class as well but I think the thing to accept is that a shell command is nearly always going to beat Java at a file list operation.
Unfortunately there's no standard way to listen to file system events in java. This may be coming in java7.
For now, you'll have to google "java filesystem events" and pick the custom implementation that matches your platform.
I've done this in my tool MetaMake. Here is the recipe:
If the index is empty, add the root directory to the index with a timestamp == dir.lastModified()-1.
Find all directories in the index
Compare the timestamp of the directory in the index with the one from the filesystem. This is a fast operation since you have the full path (no scanning of all files/dirs in the tree involved).
If the timestamp has changed, you have a change in this directory. Rescan it and update the index.
If you encounter missing directories in this step, delete the subtree from the index
If you encounter an existing directory, ignore it (will be checked in step 2)
If you encounter a new directory, add it with timestamp == dir.lastModified()-1. Make sure it gets considered in step 2.
This will allow you to notice new and deleted files in an effective manner. Since you scan only for known paths in step #2, this will be very effective. File systems are bad at enumerating all the entries in a directory but they are fast when you know the exact name.
Drawback: You will not notice changed files. So if you edit a file, this will not reflect in a change of the directory. If you need this information, too, you will have to repeat the algorithm above for the file nodes in your index. This time, you can ignore new/deleted files because they have already been updated during the run over the directories.
[EDIT] Zach mentioned that timestamps are not enough. My reply is: There simply is no other way to do this. The notion of "size" is completely undefined for directories and changes from implementation to implementation. There is no API where you can register "I want to be notified of any change being made to something in the file system". There are APIs which work while your application is alive but if it stops or misses an event, then you're out of sync.
If the file system is remote, things get worse because all kinds of network problems can cause you to get out of sync. So while my solution might not be 100% perfect and water tight, it will work for all but the most constructed exceptional case. And it's the only solution which even gets this far.
Now there is a single kind application which would want to preserve the timestamp of a directory after making a modification: A virus or worm. This will clearly break my algorithm but then, it's not meant to protect against a virus infection. If you want to protect against this, you must a completely different approach.
The only other way to achieve what Zach wants is to build a new filesystem which logs this information permanently somewhere, sell it to Microsoft and wait a few years (probably 10 or more) until everyone uses it.
One way you could speed things up is to just iterate over the directories and check the last modified time to see if the contents of the directory have changed since your last index, and if they have just do a normal scan on the directory then and see if you can find where things changed. I don't know how portable this will be tho, but it changing the hierarchy propagates up on a Linux system (might be filesystem dependant) so you can start at the root and work your way down, stopping when you hit a directory that hasn't changed
Given that we do not want to monitor file system events, could we then just keep track of the (name,size,time,checksum) of each file? The computation of the file checksum (or cryptographic hash, if you prefer) is going to be the bottleneck. You could just compute it once in the initial run, and re-compute it only when necessary subsequently (e.g. when files match on the other three attributes). Of course, we don't need to bother with this if we only want to track filenames and not file content.
You mention that your Java implementation (similar to this) is very slow compared to "dir /s". I think there are two reasons for this:
File.listFiles() is inherently slow. See this earlier question "Is there a workaround for Java’s poor performance on walking huge directories?", and this Java RFE "File.list(FilenameFilter) is not effective for huge directories" for more information. This shortcoming is apparently addressed by NIO.2, coming soon.
Are you traversing your directories using recursion? If so, try a non-recursive approach, like pushing/popping directories to be visited on/off a stack. My limited personal experience suggests that the improvement can be quite significant.
The file date approach might not be the best. For example if you restore a file from backup. Perhaps during the indexing you could store a MD5 hash of the file contents. However you might need to do some performance benchmarking to see if the performance is acceptable
I have heared that this task is very hard to do efficiently. I'm sure MS would have implemented similar tool to Windows if it was easy, especially nowadays since HD:s are growing and growing.
How about something like this:
private static String execute( String command ) throws IOException {
Process p = Runtime.getRuntime().exec( "cmd /c " + command );
InputStream i = p.getInputStream();
StringBuilder sb = new StringBuilder();
for( int c = 0 ; ( c = i.read() ) > -1 ; ) {
sb.append( ( char ) c );
}
i.close();
return sb.toString();
}
( There is a lot of room for improvement there, since that version reads one char at a time:
You can pick a better version from here to read the stream faster )
And you use as argument:
"dir /b /s M:\tests\"
If this is going to be used in a running app ( rather and being an standalone app ) you can discount the "warm up" time of the JVM, that's about 1 - 2 secs depending on your hardware.
You could give it a try to see what's the impact.
Try using git. Version control software is geared towards this kind of problem, and git has a good reputation for speed; it's specifically designed for working fast with local files. 'git diff --name-status' would get you what you want I think.
I haven't checked the implementation or the performance, but commons-io has an listFiles() method. It might be worth a try.
Related
I'm writing a backup program because Windows refuses to let me use its backup program for some reason. (I get an error code that I can report if need be.)
I only want to copy a source node file that is NEWER than a destination node file with the same name.
I found that, even though the last modified date in Windows Properties for two files showed to be identical, the source was almost invariably being copied--even though it's NOT newer.
Here are declarations:
File from_file = new File(from_name);
File to_file = new File(to_name)
Here's what I finally found for two files in different folders with the same name.
The last 3 digits returned by .lastModified() may be NONzero for one file and ZERO for the other even though the dates shown in the Properties windows for the files appear identical.
My question is WHY would that be the case???
After much frustration and debugging, I have a workaround:
destinationIsOlder = ((long)(from_file.lastModified()/1000)*1000
>
(long)( to_file.lastModified()/1000)*1000);
But WHY do I have to do that? I.e., what is Windows doing? Should it do this? Is it a bug?
And what other similar evil awaits me?
I.e., should I divide by a larger integer than 1000?
(It's not the end of the world to copy a file that's technically and incorrectly reported to be a few milliseconds newer, but it's a lot of wear and tear on the drive if it happens for every single file in the source folder and subfolders!)
(I may have just stumbled onto why xcopy didn't do what I wanted, either.)
EDIT The times returned by the two calls shown above were
1419714384951 from from_file.lastModified() and
1419714384000 from to_file.lastModified(). Therefore, although identical, including displayed date and time, from_file is newer and thus, by rule, copied, but inappropriately.
lastModified returns a long with millisecond precision - therefore the last 3 digits represent the fraction of a second.
Since the file properties dialog only displays time up to the second, the two files will show the same value.
Why are some zero and some non-zero? Lots of reasons. If the file is copied from somewhere else with only second precision, it will be zero. If an application explicitly changes the file modification time, it might only do it with second resolution. And so on.
In the end, I don't think it should affect your backup scheme that much for you to worry about it.
I am running a program that I've written in Java in Eclipse. The program has a very deep level of recursion for very large inputs. For smaller inputs the program runs fine however when large inputs are given, I get the following error:
Exception in thread "main" java.lang.StackOverflowError
Can this be solved by increasing the Java stack size and if so, how do I do this in Eclipse?
Update:
#Jon Skeet
The code is traversing a parse tree recursively in order to build up a datastructure. So, for example the code will do some work using a node in the parse tree and call itself on the node's two children, combining their results to give the overall result for the tree.
The total depth of the recursion depends on the size of the parse tree but the code seems to fail (without a larger stack) when the number of recursive calls gets into the 1000s.
Also I'm pretty sure the code isn't failing because of a bug as it works for small inputs.
Open the Run Configuration for your application (Run/Run Configurations..., then look for the applications entry in 'Java application').
The arguments tab has a text box Vm arguments, enter -Xss1m (or a bigger parameter for the maximum stack size). The default value is 512 kByte (SUN JDK 1.5 - don't know if it varies between vendors and versions).
It may be curable by increasing the stack size - but a better solution would be to work out how to avoid recursing so much. A recursive solution can always be converted to an iterative solution - which will make your code scale to larger inputs much more cleanly. Otherwise you'll really be guessing at how much stack to provide, which may not even be obvious from the input.
Are you absolutely sure it's failing due to the size of the input rather than a bug in the code, by the way? Just how deep is this recursion?
EDIT: Okay, having seen the update, I would personally try to rewrite it to avoid using recursion. Generally having a Stack<T> of "things still do to" is a good starting point to remove recursion.
Add the flag -Xss1024k in the VM Arguments.
You can also increase stack size in mb by using -Xss1m for example .
i also have the same problem while parsing schema definition files(XSD) using XSOM library,
i was able to increase Stack memory upto 208Mb then it showed heap_out_of_memory_error for which i was able to increase only upto 320mb.
the final configuration was -Xmx320m -Xss208m but then again it ran for some time and failed.
My function prints recursively the entire tree of the schema definition,amazingly the output file crossed 820Mb for a definition file of 4 Mb(Aixm library) which in turn uses 50 Mb of schema definition library(ISO gml).
with that I am convinced I have to avoid Recursion and then start iteration and some other way of representing the output, but I am having little trouble converting all that recursion to iteration.
When the argument -Xss doesn't do the job try deleting the temporary files from:
c:\Users\{user}\AppData\Local\Temp\.
This did the trick for me.
You need to have a launch configuration inside Eclipse in order to adjust the JVM parameters.
After running your program with either F11 or Ctrl-F11, open the launch configurations in Run -> Run Configurations... and open your program under "Java Applications". Select the Arguments pane, where you will find "VM arguments".
This is where -Xss1024k goes.
If you want the launch configuration to be a file in your workspace (so you can right click and run it), select the Common pane, and check the Save as -> Shared File checkbox and browse to the location you want the launch file. I usually have them in a separate folder, as we check them into CVS.
Look at Morris in-order tree traversal which uses constant space and runs in O(n) (up to 3 times longer than your normal recursive traversal - but you save hugely on space). If the nodes are modifiable, than you could save the calculated result of the sub-tree as you backtrack to its root (by writing directly to the Node).
When using JBOSS Server, double click on the server:
Go to "Open Launch Configuration"
Then change min and max memory sizes (like 1G, 1m):
Points:
We process thousands of flat files in a day, concurrently.
Memory constraint is a major issue.
We use thread for each file process.
We don't sort by columns. Each line (record) in the file is treated as one column.
Can't Do:
We cannot use unix/linux's sort commands.
We cannot use any database system no matter how light they can be.
Now, we cannot just load everything in a collection and use the sort mechanism. It will eat up all the memory and the program is gonna get a heap error.
In that situation, how would you sort the records/lines in a file?
It looks like what you are looking for is
external sorting.
Basically, you sort small chunks of data first, write it back to the disk and then iterate over those to sort all.
As other mentionned, you can process in steps.
I would like to explain this with my own words (differs on point 3) :
Read the file sequentially, process N records at a time in memory (N is arbitrary, depending on your memory constraint and the number T of temporary files that you want).
Sort the N records in memory, write them to a temp file. Loop on T until you are done.
Open all the T temp files at the same time, but read only one record per file. (Of course, with buffers). For each of these T records, find the smaller, write it to the final file, and advance only in that file.
Advantages:
The memory consumption is as low as you want.
You only do the double of disk accesses comparing to a everything-in-memory policy. Not bad! :-)
Exemple with numbers:
Original file with 1 million records.
Choose to have 100 temp files, so read and sort 10 000 records at a time, and drop these in their own temp file.
Open the 100 temp file at a time, read the first record in memory.
Compare the first records, write the smaller and advance this temp file.
Loop on step 5, one million times.
EDITED
You mentionned a multi-threaded application, so I wonder ...
As we seen from these discussions on this need, using less memory gives less performance, with a dramatic factor in this case. So I could also suggest to use only one thread to process only one sort at a time, not as a multi-threaded application.
If you process ten threads, each with a tenth of the memory available, your performance will be miserable, much much less than a tenth of the initial time. If you use only one thread, and queue the 9 other demands and process them in turn, you global performance will be much better, you will finish the ten tasks much faster.
After reading this response :
Sort a file with huge volume of data given memory constraint
I suggest you consider this distribution sort. It could be huge gain in your context.
The improvement over my proposal is that you don't need to open all the temp files at once, you only open one of them. It saves your day! :-)
You can read the files in smaller parts, sort these and write them to temporrary files. Then you read two of them sequentially again and merge them to a bigger temporary file and so on. If there is only one left you have your file sorted. Basically that's the Megresort algorithm performed on external files. It scales quite well with aribitrary large files but causes some extra file I/O.
Edit: If you have some knowledge about the likely variance of the lines in your files you can employ a more efficient algorithm (distribution sort). Simplified you would read the original file once and write each line to a temporary file that takes only lines with the same first char (or a certain range of first chars). Then you iterate over all the (now small) temporary files in ascending order, sort them in memory and append them directly to the output file. If a temporary file turns out to be too big for sorting in memory, you can reapeat the same process for this based on the 2nd char in the lines and so on. So if your first partitioning was good enough to produce small enough files, you will have only 100% I/O overhead regardless how large the file is, but in the worst case it can become much more than with the performance wise stable merge sort.
In spite of your restriction, I would use embedded database SQLITE3. Like yourself, I work weekly with 10-15 millions of flat file lines and it is very, very fast to import and generate sorted data, and you only need a little free of charge executable (sqlite3.exe). For example: Once you download the .exe file, in a command prompt you can do this:
C:> sqlite3.exe dbLines.db
sqlite> create table tabLines(line varchar(5000));
sqlite> create index idx1 on tabLines(line);
sqlite> .separator '\r\n'
sqlite> .import 'FileToImport' TabLines
then:
sqlite> select * from tabLines order by line;
or save to a file:
sqlite> .output out.txt
sqlite> select * from tabLines order by line;
sqlite> .output stdout
I would spin up an EC2 cluster and run Hadoop's MergeSort.
Edit: not sure how much detail you would like, or on what. EC2 is Amazon's Elastic Compute Cloud - it lets you rent virtual servers by the hour at low cost. Here is their website.
Hadoop is an open-source MapReduce framework designed for parallel processing of large data sets. A job is a good candidate for MapReduce when it can be split into subsets that can be processed individually and then merged together, usually by sorting on keys (ie the divide-and-conquer strategy). Here is its website.
As mentioned by the other posters, external sorting is also a good strategy. I think the way I would decide between the two depends on the size of the data and speed requirements. A single machine is likely going to be limited to processing a single file at a time (since you will be using up available memory). So look into something like EC2 only if you need to process files faster than that.
You could use the following divide-and-conquer strategy:
Create a function H() that can assign each record in the input file a number. For a record r2 that will be sorted behind a record r1 it must return a larger number for r2 than for r1. Use this function to partition all the records into separate files that will fit into memory so you can sort them. Once you have done that you can just concatenate the sorted files to get one large sorted file.
Suppose you have this input file where each line represents a record
Alan Smith
Jon Doe
Bill Murray
Johnny Cash
Lets just build H() so that it uses the first letter in the record so you might get up to 26 files but in this example you will just get 3:
<file1>
Alan Smith
<file2>
Bill Murray
<file10>
Jon Doe
Johnny Cash
Now you can sort each individual file. Which would swap "Jon Doe" and "Johnny Cash" in <file10>. Now, if you just concatenate the 3 files you'll have a sorted version of the input.
Note that you divide first and only conquer (sort) later. However, you make sure to do the partitioning in a way that the resulting parts which you need to sort don't overlap which will make merging the result much simpler.
The method by which you implement the partitioning function H() depends very much on the nature of your input data. Once you have that part figured out the rest should be a breeze.
If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.
Here is a way to do it without heavy use of sorting in-side Java and without using DB.
Assumptions : You have 1TB space and files contain or start with unique number, but are unsorted
Divide the files N times.
Read those N files one by one, and create one file for each line/number
Name that file with corresponding number.While naming keep a counter updated to store least count.
Now you can already have the root folder of files marked for sorting by name or pause your program to give you the time to fire command on your OS to sort the files by names. You can do it programmatically too.
Now you have a folder with files sorted with their name, using the counter start taking each file one by one, put numbers in your OUTPUT file, close it.
When you are done you will have a large file with sorted numbers.
I know you mentioned not using a database no matter how light... so, maybe this is not an option. But, what about hsqldb in memory... submit it, sort it by query, purge it. Just a thought.
You can use SQL Lite file db, load the data to the db and then let it sort and return the results for you.
Advantages: No need to worry about writing the best sorting algorithm.
Disadvantage: You will need disk space, slower processing.
https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files
You can do it with only two temp files - source and destination - and as little memory as you want.
On first step your source is the original file, on last step the destination is the result file.
On each iteration:
read from the source file into a sliding buffer a chunk of data half size of the buffer;
sort the whole buffer
write to the destination file the first half of the buffer.
shift the second half of the buffer to the beginning and repeat
Keep a boolean flag that says whether you had to move some records in current iteration.
If the flag remains false, your file is sorted.
If it's raised, repeat the process using the destination file as a source.
Max number of iterations: (file size)/(buffer size)*2
You could download gnu sort for windows: http://gnuwin32.sourceforge.net/packages/coreutils.htm Even if that uses too much memory, it can merge smaller sorted files as well. It automatically uses temp files.
There's also the sort that comes with windows within cmd.exe. Both of these commands can specify the character column to sort by.
File sort software for big file https://github.com/lianzhoutw/filesort/ .
It is based on file merge sort algorithm.
If you can move forward/backward in a file (seek), and rewrite parts of the file, then you should use bubble sort.
You will have to scan lines in the file, and only have to have 2 rows in memory at the moment, and then swap them if they are not in the right order. Repeat the process until there are no files to swap.
I have a set of source folders. I use a Java class to build the distribution file out of these folders. I'd like to write another little class in Java which runs every half a second, checks if any of the files in the folders have changed, and if yes, run the building class.
So, how do I detect easily that a folder has been modified ?
I think you will need to check the directory and subdirectory modfication times (for files being added/removed) and the file modification times (for the changes in each file).
Write a recursive routine that checks a directory for it's modification time and if it's changed, plus each files. Then checks the directory contents, and call recursively for any subdirectories. You should just be able to check for any modification times greater than when you last ran the check.
See File.lastModified()
EDIT: Since I wrote the above, Java 7 came out with its directory watching capability.
Here a list of possible solutions and an example of simple File/Folder Watcher.
If you can are allowed to use Java 7, it has support for platform independent directory/file change notifications.
JNA has a sample for cross platform change notification here. Not sure how easy you might find it.
I don't know if it's any good, but here's one person's take on the problem.
Sounds like .NET has something built-in: FileSystemWatcher
UPDATE: Thanks to kd304, I just learned that Java 7 will have the same feature. Won't do you much good today, unless you can use the preview release.
You need to watch each file and keep track of the File.lastModified attribute and check the File.exists flag together with a bit of simple recursion to walk the directory structure.
with NIO2 (Java7) it will be very easy. With Java6 you could call list() and compare with previous list once a second? (a poor man watching service)
Yes, there are a number of available listeners for directories, but they're all relatively complicated and most involve threads.
A few days ago I ended up in an almost heated discussion with one of our engineers over whether it was a permissible creating a new thread (in a web application) simply to monitor a directory tree. In the end I agreed with him, but by virtue of coming up with something so fast that having a listener is unnecessary. Note: the solution described below only works if you don't need to know which file has changed, only that a file has changed.
You provide the following method with a Collection of Files (e.g., obtained via Apache IO's FileUtils.listFiles() method) and this returns a hash for the collection. If any file is added, deleted, or its modification date changed, the hash will change.
In my tests, 50K files takes about 750ms on a 3Ghz Linux box. Touching any of the files alters the hash. In my own implementation I'm using a different hash algorithm (DJB) that's a bit faster but that's the gist of it. We now just store the hash and check each time as it's pretty painless, especially for smaller file collections. If anything changes we then re-index the directory. The complexity of a watcher just wasn't worth it in our application.
/**
* Provided a directory and a file extension, returns
* a hash using the Adler hash algorithm.
*
* #param files the Collection of Files to hash.
* #return a hash of the Collection.
*/
public static long getHash( Collection<File> files )
{
Adler32 adler = new Adler32();
StringBuilder sb = new StringBuilder();
for ( File f : files ) {
String s = f.getParent()+'/'+f.getName()+':'+String.valueOf(f.lastModified());
adler.reset();
adler.update(s.getBytes());
sb.append(adler.getValue()+' ');
}
adler.reset();
adler.update(sb.toString().getBytes());
return adler.getValue();
}
And yes, there's room for improvement (e.g., we use a hash method rather than inlining it). The above is cut down from our actual code but should give you a good idea what we did.
I'm looking for something that will monitor Windows directories for size and file count over time. I'm talking about a handful of servers and a few thousand folders (millions of files).
Requirements:
Notification on X increase in size over Y time
Notification on X increase in file count over Y time
Historical graphing (or at least saving snapshot data over time) of size and file count
All of this on a set of directories and their child directories
I'd prefer a free solution but would also appreciate getting pointed in the right direction. If we were to write our own, how would we go about doing that? Available languages being Ruby, Groovy, Java, Perl, or PowerShell (since I'd be writing it).
There are several solutions out there, including some free ones. Some that I have worked with include:
Nagios
and
Big Brother
A quick google search can probably find more.
You might want to take a look at PolyMon, which is an open source systems monitoring solution. It allows you to write custom monitors in any .NET language, and allows you to create custom PowerShell monitors.
It stores data on a SQL Server back end and provides graphing. For your purpose, you would just need a script that would get the directory size and file count.
Something like:
$size = 0
$count = 0
$path = '\\unc\path\to\directory\to\monitor'
get-childitem -path $path -recurse | Where-Object {$_ -is [System.IO.FileInfo]} | ForEach-Object {$size += $_.length; $count += 1}
In reply to Scott's comment:
Sure. you could wrap it in a while loop
$ESCkey = 27
Write-Host "Press the ESC key to stop sniffing" -foregroundcolor "CYAN"
$Running=$true
While ($Running)
{
if ($host.ui.RawUi.KeyAvailable) {
$key = $host.ui.RawUI.ReadKey("NoEcho,IncludeKeyUp,IncludeKeyDown")
if ($key.VirtualKeyCode -eq $ESCkey) {
$Running=$False
}
#rest of function here
}
I would not do that for a PowerShell monitor, which you can schedule to run periodically, but for a script to run in the background, the above would work. You could even add some database access code to log the results to a database, or log it to a file.. what ever you want.
You can certainly accomplish this with PowerShell and WMI. You would need some sort of DB backend like SQL Express. But I agree that a tool like Polymon is a better approach. The one thing that might make a diffence is the issue of scale. Do you need to monitor 1 folder on 1 server or hundreds?
http://sourceforge.net/projects/dirviewer/ -- DirViewer is a light pure java application for directory tree view and recursive disk usage statistics, using JGoodies-Looks look and feel similar to windows XP.