I'm looking for something that will monitor Windows directories for size and file count over time. I'm talking about a handful of servers and a few thousand folders (millions of files).
Requirements:
Notification on X increase in size over Y time
Notification on X increase in file count over Y time
Historical graphing (or at least saving snapshot data over time) of size and file count
All of this on a set of directories and their child directories
I'd prefer a free solution but would also appreciate getting pointed in the right direction. If we were to write our own, how would we go about doing that? Available languages being Ruby, Groovy, Java, Perl, or PowerShell (since I'd be writing it).
There are several solutions out there, including some free ones. Some that I have worked with include:
Nagios
and
Big Brother
A quick google search can probably find more.
You might want to take a look at PolyMon, which is an open source systems monitoring solution. It allows you to write custom monitors in any .NET language, and allows you to create custom PowerShell monitors.
It stores data on a SQL Server back end and provides graphing. For your purpose, you would just need a script that would get the directory size and file count.
Something like:
$size = 0
$count = 0
$path = '\\unc\path\to\directory\to\monitor'
get-childitem -path $path -recurse | Where-Object {$_ -is [System.IO.FileInfo]} | ForEach-Object {$size += $_.length; $count += 1}
In reply to Scott's comment:
Sure. you could wrap it in a while loop
$ESCkey = 27
Write-Host "Press the ESC key to stop sniffing" -foregroundcolor "CYAN"
$Running=$true
While ($Running)
{
if ($host.ui.RawUi.KeyAvailable) {
$key = $host.ui.RawUI.ReadKey("NoEcho,IncludeKeyUp,IncludeKeyDown")
if ($key.VirtualKeyCode -eq $ESCkey) {
$Running=$False
}
#rest of function here
}
I would not do that for a PowerShell monitor, which you can schedule to run periodically, but for a script to run in the background, the above would work. You could even add some database access code to log the results to a database, or log it to a file.. what ever you want.
You can certainly accomplish this with PowerShell and WMI. You would need some sort of DB backend like SQL Express. But I agree that a tool like Polymon is a better approach. The one thing that might make a diffence is the issue of scale. Do you need to monitor 1 folder on 1 server or hundreds?
http://sourceforge.net/projects/dirviewer/ -- DirViewer is a light pure java application for directory tree view and recursive disk usage statistics, using JGoodies-Looks look and feel similar to windows XP.
Related
I know it is possible to get system information in Java and have searched SO for this particular question but have come up empty.
Question:
Can I gather complete system information about all attached monitors? Particularly I am hoping to get a unique ID, model number, or manufacturer of each monitor.
Wayan Saryada for example demonstrated a simple example here that gets basic information about all connected monitors. Code follows to protect from link rot:
//
// Get local graphics environment
//
GraphicsEnvironment env = GraphicsEnvironment.getLocalGraphicsEnvironment();
GraphicsDevice[] devices = env.getScreenDevices();
int sequence = 1;
for (GraphicsDevice device : devices) {
System.out.println("Screen Number [" + (sequence++) + "]");
System.out.println("Width : " + device.getDisplayMode().getWidth());
System.out.println("Height : " + device.getDisplayMode().getHeight());
System.out.println("Refresh Rate: " + device.getDisplayMode().getRefreshRate());
System.out.println("Bit Depth : " + device.getDisplayMode().getBitDepth());
System.out.println("");
}
This code however produces nothing unique. Its just generic information that is helpful but does not meet my end goal.
Use Case / End Goal:
My intentions is to make a portable application that tracks (detects) what computer/ computer setup you are using and displays the appropriate desktop. One example is on my work computer with a large monitor it lays out my desktop one way and then on my laptop it hides some things; long story short it would show an alternate desktop display (less icons, only the ones I need for work/ home).
I have all that extra stuff worked out but I need a way to track not just what computer I am on but which monitors I have attached at the time. Essentially a "MAC address" for attached monitors; I know that is not a thing. This way on a triple monitor setup for example my application knows what goes where. Then if I remove one monitor it knows what to change to.
So long story short there just is no way to accurately detect in Java which monitor is which in the GraphicsDevice API. There are several other ways to detect this information but again Java runs into the same problem. Java just auto increments the monitors it finds like so:
Display0,
Display1,
Display2
...And so on.
Each time the computer turns on different monitor assignments can happen and each time a monitor turns off the order can change. The same is true if the graphics driver crashes or reboots while in use.
One Solution
This does not solve my original issue: The OP's questions. However I know it is extremely hard to get full human readable system information with Java, especially all attached monitors. I put in a request to dbwiddis on GitHub and he was able to add this feature to Oshi. His Java library pulls everything you need to know about every attached monitor to a computer. A perfect all in one solution for Network Admins.
The Issue Still / Hack
There still is no way (including using other API's accessible through other programming languages) to detect which monitor is which. The "MAC Address" of all your attached monitors if you will. The only way to do this reliably (really a hack) is to pop up a terminal that outputs information that you then can grab with Java. This way Java can manually make the matches. It can go out to Display2 and read the terminal window that is open on it and now we know which monitor is Display2.
I have no code for this right now because it is extremely unreliable and not guaranteed to be cross OS compatible. It should work fine in windows but your will need to execute a batch file or powershell script to get the terminals on the monitors you want.
Just look at the GraphicsDevice API:
https://docs.oracle.com/javase/7/docs/api/java/awt/GraphicsDevice.html#getIDstring%28%29
I need to run a k-medoids clustering algorithm by using ELKI programmatically. I have a similarity matrix that I wish to input to the algorithm.
Is there any code snippet available for how to run ELKI algorithms?
I basically need to know how to create Database and Relation objects, create a custom distance function, and read the algorithm output.
Unfortunately the ELKI tutorial (http://elki.dbs.ifi.lmu.de/wiki/Tutorial) focuses on the GUI version and on implementing new algorithms, and trying to write code by looking at the Javadoc is frustrating.
If someone is aware of any easy-to-use library for k-medoids, that's probably a good answer to this question as well.
We do appreciate documentation contributions! (Update: I have turned this post into a new ELKI tutorial entry for now.)
ELKI does advocate to not embed it in other applications Java for a number of reasons. This is why we recommend using the MiniGUI (or the command line it constructs). Adding custom code is best done e.g. as a custom ResultHandler or just by using the ResultWriter and parsing the resulting text files.
If you really want to embed it in your code (there are a number of situations where it is useful, in particular when you need multiple relations, and want to evaluate different index structures against each other), here is the basic setup for getting a Database and Relation:
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(FileBasedDatabaseConnection.INPUT_ID, filename);
// Add other parameters for the database here!
// Instantiate the database:
Database db = ClassGenericsUtil.parameterizeOrAbort(
StaticArrayDatabase.class,
params);
// Don't forget this, it will load the actual data...
db.initialize();
Relation<DoubleVector> vectors = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<LabelList> labels = db.getRelation(TypeUtil.LABELLIST);
If you want to program more general, use NumberVector<?>.
Why we do (currently) not recommend using ELKI as a "library":
The API is still changing a lot. We keep on adding options, and we cannot (yet) provide a stable API. The command line / MiniGUI / Parameterization is much more stable, because of the handling of default values - the parameterization only lists the non-default parameters, so only if these change you'll notice.
In the code example above, note that I also used this pattern. A change to the parsers, database etc. will likely not affect this program!
Memory usage: data mining is quite memory intensive. If you use the MiniGUI or command line, you have a good cleanup when the task is finished. If you invoke it from Java, changes are really high that you keep some reference somewhere, and end up leaking lots of memory. So do not use above pattern without ensuring that the objects are properly cleaned up when you are done!
By running ELKI from the command line, you get two things for free:
no memory leaks. When the task is finished, the process quits and frees all memory.
no need to rerun it twice for the same data. Subsequent analysis does not need to rerun the algorithm.
ELKI is not designed as embeddable library for good reasons. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity.
ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem.
Best ways of using ELKI
Here are some tips and tricks:
Use the MiniGUI to build a command line. Note that in the logging window of the "GUI" it shows the corresponding command line parameters - running ELKI from command line is easy to script, and can easily be distributed to multiple computers e.g. via Grid Engine.
#!/bin/bash
for k in $( seq 3 39 ); do
java -jar elki.jar KDDCLIApplication \
-dbc.in whatever \
-algorithm clustering.kmeans.KMedoidsEM \
-kmeans.k $k \
-resulthandler ResultWriter -out.gzip \
-out output/k-$k
done
Use indexes. For many algorithms, index structures can make a huge difference!
(But you need to do some research which indexes can be used for which algorithms!)
Consider using the extension points such as ResultWriter. It may be the easiest for you to hook into this API, then use ResultUtil to select the results that you want to output in your own preferred format or analyze:
List<Clustering<? extends Model>> clusterresults =
ResultUtil.getClusteringResults(result);
To identify objects, use labels and a LabelList relation. The default parser will do this when it sees text along the numerical attributes, i.e. a file such as
1.0 2.0 3.0 ObjectLabel1
will make it easy to identify the object by its label!
UPDATE: See ELKI tutorial created out of this post for updates.
ELKI's documentation is pretty sparse (I don't know why they don't include a simple "hello world" program in the examples)
You could try Java-ML. Its documentation is a bit more user friendly, and it does have K-medoid.
Clustering example with Java-ML |
http://java-ml.sourceforge.net/content/clustering-basics
K-medoid |
http://java-ml.sourceforge.net/api/0.1.7/
I want to run a single unit test and collect its "profiling" information: how often every method was called, how many instances of certain class were created, how much time did it take to execute certain method/thread, etc. Then, I want to compare this information with some expected values. Are there any Profilers for Java than let me do this (all this should be done automatically, without any GUI or user interaction, of course)?
This is how I want it to work:
public class MyTest {
#Test
public void justTwoCallsToFoo() {
Profiler.start(Foo.class);
Foo foo = new Foo();
foo.someMethodToProfile(); // profiler should collect data here
assertThat(
Profiler.getTotalCallsMadeTo(Foo.class, "barMethod"),
equalTo(3)
);
}
}
Capturing snapshots of profiling information automatically is possible with several Java profilers, you have to add VM parameters to the Java start command to load the profiler and you have to tell the profiler to record data and save a snapshot on exit.
In JProfiler, you could use the "profile" ant task to do this. It takes the profiling settings from a config file that is exported from the JProfiler GUI. You can use the Controller API to start recording and save a snapshot.
Then you have two choices:
You can use the jpcompare command line utility or the corresponding "compare" ant task to compare the snapshot against a known baseline. There are several comparisons available. You can export the results as HTML or in a machine-readable format (CSV/XML).
You can extract data from the snapshot with the jpexport command line utility or the corresponding "export" ant task. Then you can write your own code to analyze the profiling data from a particular run.
For a limited set of profiling data, it is actually possible to use the JProfiler API to write your own profiler that does something like the Profiler.getTotalCallsMadeTo(Foo.class, "barMethod") in your example. See the api/samples/platform example in a JPofiler installation. You would get the information from the hot spot data.
However, for making assertions on whether a specific method was called I would suggest to take an AOP-based approach. Using a profiler-based approach is appropriate for detecting performance regressions.
Disclaimer: My company develops JProfiler.
I found out that this could be done with a J2SE built-in HPROF profiling tool.
java -agentlib:hprof=cpu=times MyTest
It produces a text file formatted like this:
CPU SAMPLES BEGIN (total = 126) Fri Oct 22 12:12:14 2004
rank self accum count trace method
1 53.17% 53.17% 67 300027 java.util.zip.ZipFile.getEntry
2 17.46% 70.63% 22 300135 java.util.zip.ZipFile.getNextEntry
3 5.56% 76.19% 7 300111 java.lang.ClassLoader.defineClass2
4 3.97% 80.16% 5 300140 java.io.UnixFileSystem.list
5 2.38% 82.54% 3 300149 java.lang.Shutdown.halt0
....
Then it's easy to parse the file and extract the methods you're interested in. The solution is completely free and JSE built-in, which is a great benefit comparing to other tools.
I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.
About our situation:
Heap dumps are larger than 1 GB
We have heap dumps from 5 occasions.
We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
Nobody understands the framework in depth. It has been transfered to one guy in India who barely keeps up with answering e-mails.
We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)
Question: What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?
It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.
Also, you can't know if something is a leak or not without know why the class is there in the first place.
I just spent the past couple of weeks doing exactly this, and I used an iterative process.
First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.
Rather, I relied almost solely on jmap histograms.
I imagine you're familiar with these, but for those not:
jmap -histo:live <pid> > histogram.out
creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.
I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.
I ran several different analyses on this data.
I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.
I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.
However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.
To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.
I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.
Now, all this can do is identify potentially leaking classes.
This is where understanding how the classes are used, and whether they should or should not be their comes in to play.
For example, you may find that you have a lot of Map.Entry classes, or some other system class.
Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.
Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.
Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.
In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).
A profiler may be able to help you track the owners of that "now leaked" class.
But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.
Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.
I've used it in the past to help hunt down suspicious leaks.
This answer expands upon #Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.
The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:
^
|
s | Legend:
i | * - data point
z | -- - trend
e |
( |
b | *
y | --
t | --
e | * -- *
s | --
) | *-- *
| -- *
| -- *
--------------------------------------->
time
Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:
num #instances #bytes class name
----------------------------------------------
1: 4632416 392305928 [C
2: 6509258 208296256 java.util.HashMap$Node
3: 4615599 110774376 java.lang.String
5: 16856 68812488 [B
6: 278914 67329632 [Ljava.util.HashMap$Node;
7: 1297968 62302464
...
To a csv file with a the datetime of each heap dump:
2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488
...
Using this script:
# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016.09.20.17.33.40.txt -dt "2016.09.20 17:33:40" >> heap.csv
my $file;
my $dt;
GetOptions (
"f=s" => \$file,
"dt=s" => \$dt
) or usage("Error in command line arguments");
open my $fh, '<', $file or die $!;
my $last=0;
my $lastRotation=0;
while(not eof($fh)) {
my $line = <$fh>;
$line =~ s/\R//g; #remove newlines
# 1: 4442084 369475664 [C
my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([\$\[\w\.]+)\s*$/) ;
if($instances) {
print "$dt,$class,$instances,$size\n";
}
}
close($fh);
Create a table to put the data in
CREATE TABLE heap_histogram (
histwhen timestamp without time zone NOT NULL,
class character varying NOT NULL,
instances integer NOT NULL,
bytes integer NOT NULL
);
Copy the data into your new table
\COPY heap_histogram FROM 'heap.csv' WITH DELIMITER ',' CSV ;
Run the slop query against size (num of bytes) query:
SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
FROM public.heap_histogram
GROUP BY class
HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
ORDER BY slope DESC
;
Interpret the results:
class | slope
---------------------------+----------------------
java.util.ArrayList | 71.7993806279174
java.util.HashMap | 49.0324576155785
java.lang.String | 31.7770770326123
joe.schmoe.BusinessObject | 23.2036817108056
java.lang.ThreadLocal | 20.9013528767851
The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.
My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.
Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.
We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.
If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.
Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.
When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".
I've had success with IBM Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.
There are great tools like Eclipse MAT and Heap Hero to analyze heap dumps. However, you need to provide these tools with heap dumps captured in the correct format and correct point in time.
This article gives you multiple options to capture heap dumps. However, in my opinion, first 3 are effective options to use and others are good options to be aware.
1. jmap
2. HeapDumpOnOutOfMemoryError
3. jcmd
4. JVisualVM
5. JMX
6. Programmatic Approach
7. IBM Administrative Console
7 Options to capture Java Heap dumps
If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?
I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).
I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.
I am writing a little program that creates an index of all files on my directories. It basically iterates over each file on the disk and stores it into a searchable database, much like Unix's locate. The problem is, that index generation is quite slow since I have about a million files.
Once I have generated an index, is there a quick way to find out which files have been added or removed on the disk since the last run?
EDIT: I do not want to monitor the file system events. I think the risk is too high to get out of sync, I would much prefer to have something like a quick re-scan that quickly finds where files have been added / removed. Maybe with directory last modified date or something?
A Little Benchmark
I just made a little benchmark. Running
dir /b /s M:\tests\ >c:\out.txt
Takes 0.9 seconds and gives me all the information I need. When I use a Java implementation (much like this), it takes about 4.5 seconds. Any ideas how to improve at least this brute force approach?
Related posts: How to see if a subfile of a directory has changed
Can you jump out of java.
You could simply use
dir /b /s /on M:\tests\
the /on sorts by name
if you pipe that out to out.txt
Then do a diff to the last time you ran this file either in Java or in a batch file. Something like this in Dos. You'd need to get a diff tool, either diff in cygwin or the excellent http://gnuwin32.sourceforge.net/packages/diffutils.htm
dir /b /s /on m:\tests >new.txt
diff new.txt archive.txt >diffoutput.txt
del archive.txt
ren new.txt archive.txt
Obviously you could use a java diff class as well but I think the thing to accept is that a shell command is nearly always going to beat Java at a file list operation.
Unfortunately there's no standard way to listen to file system events in java. This may be coming in java7.
For now, you'll have to google "java filesystem events" and pick the custom implementation that matches your platform.
I've done this in my tool MetaMake. Here is the recipe:
If the index is empty, add the root directory to the index with a timestamp == dir.lastModified()-1.
Find all directories in the index
Compare the timestamp of the directory in the index with the one from the filesystem. This is a fast operation since you have the full path (no scanning of all files/dirs in the tree involved).
If the timestamp has changed, you have a change in this directory. Rescan it and update the index.
If you encounter missing directories in this step, delete the subtree from the index
If you encounter an existing directory, ignore it (will be checked in step 2)
If you encounter a new directory, add it with timestamp == dir.lastModified()-1. Make sure it gets considered in step 2.
This will allow you to notice new and deleted files in an effective manner. Since you scan only for known paths in step #2, this will be very effective. File systems are bad at enumerating all the entries in a directory but they are fast when you know the exact name.
Drawback: You will not notice changed files. So if you edit a file, this will not reflect in a change of the directory. If you need this information, too, you will have to repeat the algorithm above for the file nodes in your index. This time, you can ignore new/deleted files because they have already been updated during the run over the directories.
[EDIT] Zach mentioned that timestamps are not enough. My reply is: There simply is no other way to do this. The notion of "size" is completely undefined for directories and changes from implementation to implementation. There is no API where you can register "I want to be notified of any change being made to something in the file system". There are APIs which work while your application is alive but if it stops or misses an event, then you're out of sync.
If the file system is remote, things get worse because all kinds of network problems can cause you to get out of sync. So while my solution might not be 100% perfect and water tight, it will work for all but the most constructed exceptional case. And it's the only solution which even gets this far.
Now there is a single kind application which would want to preserve the timestamp of a directory after making a modification: A virus or worm. This will clearly break my algorithm but then, it's not meant to protect against a virus infection. If you want to protect against this, you must a completely different approach.
The only other way to achieve what Zach wants is to build a new filesystem which logs this information permanently somewhere, sell it to Microsoft and wait a few years (probably 10 or more) until everyone uses it.
One way you could speed things up is to just iterate over the directories and check the last modified time to see if the contents of the directory have changed since your last index, and if they have just do a normal scan on the directory then and see if you can find where things changed. I don't know how portable this will be tho, but it changing the hierarchy propagates up on a Linux system (might be filesystem dependant) so you can start at the root and work your way down, stopping when you hit a directory that hasn't changed
Given that we do not want to monitor file system events, could we then just keep track of the (name,size,time,checksum) of each file? The computation of the file checksum (or cryptographic hash, if you prefer) is going to be the bottleneck. You could just compute it once in the initial run, and re-compute it only when necessary subsequently (e.g. when files match on the other three attributes). Of course, we don't need to bother with this if we only want to track filenames and not file content.
You mention that your Java implementation (similar to this) is very slow compared to "dir /s". I think there are two reasons for this:
File.listFiles() is inherently slow. See this earlier question "Is there a workaround for Java’s poor performance on walking huge directories?", and this Java RFE "File.list(FilenameFilter) is not effective for huge directories" for more information. This shortcoming is apparently addressed by NIO.2, coming soon.
Are you traversing your directories using recursion? If so, try a non-recursive approach, like pushing/popping directories to be visited on/off a stack. My limited personal experience suggests that the improvement can be quite significant.
The file date approach might not be the best. For example if you restore a file from backup. Perhaps during the indexing you could store a MD5 hash of the file contents. However you might need to do some performance benchmarking to see if the performance is acceptable
I have heared that this task is very hard to do efficiently. I'm sure MS would have implemented similar tool to Windows if it was easy, especially nowadays since HD:s are growing and growing.
How about something like this:
private static String execute( String command ) throws IOException {
Process p = Runtime.getRuntime().exec( "cmd /c " + command );
InputStream i = p.getInputStream();
StringBuilder sb = new StringBuilder();
for( int c = 0 ; ( c = i.read() ) > -1 ; ) {
sb.append( ( char ) c );
}
i.close();
return sb.toString();
}
( There is a lot of room for improvement there, since that version reads one char at a time:
You can pick a better version from here to read the stream faster )
And you use as argument:
"dir /b /s M:\tests\"
If this is going to be used in a running app ( rather and being an standalone app ) you can discount the "warm up" time of the JVM, that's about 1 - 2 secs depending on your hardware.
You could give it a try to see what's the impact.
Try using git. Version control software is geared towards this kind of problem, and git has a good reputation for speed; it's specifically designed for working fast with local files. 'git diff --name-status' would get you what you want I think.
I haven't checked the implementation or the performance, but commons-io has an listFiles() method. It might be worth a try.