I'm writing a backup program because Windows refuses to let me use its backup program for some reason. (I get an error code that I can report if need be.)
I only want to copy a source node file that is NEWER than a destination node file with the same name.
I found that, even though the last modified date in Windows Properties for two files showed to be identical, the source was almost invariably being copied--even though it's NOT newer.
Here are declarations:
File from_file = new File(from_name);
File to_file = new File(to_name)
Here's what I finally found for two files in different folders with the same name.
The last 3 digits returned by .lastModified() may be NONzero for one file and ZERO for the other even though the dates shown in the Properties windows for the files appear identical.
My question is WHY would that be the case???
After much frustration and debugging, I have a workaround:
destinationIsOlder = ((long)(from_file.lastModified()/1000)*1000
>
(long)( to_file.lastModified()/1000)*1000);
But WHY do I have to do that? I.e., what is Windows doing? Should it do this? Is it a bug?
And what other similar evil awaits me?
I.e., should I divide by a larger integer than 1000?
(It's not the end of the world to copy a file that's technically and incorrectly reported to be a few milliseconds newer, but it's a lot of wear and tear on the drive if it happens for every single file in the source folder and subfolders!)
(I may have just stumbled onto why xcopy didn't do what I wanted, either.)
EDIT The times returned by the two calls shown above were
1419714384951 from from_file.lastModified() and
1419714384000 from to_file.lastModified(). Therefore, although identical, including displayed date and time, from_file is newer and thus, by rule, copied, but inappropriately.
lastModified returns a long with millisecond precision - therefore the last 3 digits represent the fraction of a second.
Since the file properties dialog only displays time up to the second, the two files will show the same value.
Why are some zero and some non-zero? Lots of reasons. If the file is copied from somewhere else with only second precision, it will be zero. If an application explicitly changes the file modification time, it might only do it with second resolution. And so on.
In the end, I don't think it should affect your backup scheme that much for you to worry about it.
Related
So I've got these huge text files that are filled with a single comma delimited record per line. I need a way to process the files line by line, removing lines that meet certain criteria. Some of the removals are easy, such as one of the fields is less than a certain length. The hardest criteria is that these lines all have timestamps. Many records are identical except for their timestamps and I have to remove all records but one that are identical and within 15 seconds of one another.
So I'm wondering if some others can come up with the best approach for this. I did come up with a small program in Java that accomplishes the task, using JodaTime for the timestamp stuff which makes it really easy. However, the initial way I coded the program was running into OutofMemory Heap Space errors. I refactored the code a bit and it seemed ok for the most part but I do still believe it has some memory issues as once in awhile the program just seems to get hung up. That and it just seems to take way too long. I'm not sure if this is a memory leak issue, a poor coding issue, or something else entirely. And yes I tried increasing the Heap Size significantly but still was having issues.
I will say that the program needs to be in either Perl or Java. I might be able to make a python script work too but I'm not overly familiar with python. As I said, the timestamp stuff is easiest (to me) in Java because of the JodaTime library. I'm not sure how I'd accomplish the timestamp stuff in Perl. But I'm up for learning and using whatever would work best.
I will also add the files being read in vary tremendously in size but some big ones are around 100Mb with something like 1.3 million records.
My code essentially reads in all the records and puts them into a Hashmap with the keys being a specific subset of the data from a record that similar records would share. So a subset of the record not including the timestamps which would be different. This way you'd end up with some number of records with identical data but that occurred at different times. (So completely identical minus the timestamps).
The value of each key then, is a Set of all records that have the same subset of data. Then I simply iterate through the Hashmap, taking each set and iterating through it. I take the first record and compare its times to all the rest to see if they're within 15 seconds. If so the record is removed. Once that set is finished it's written out to a file until all the records have been gone through. Hopefully that makes sense.
This works but clearly the way I'm doing it is too memory intensive. Anyone have any ideas on a better way to do it? Or, a way I can do this in Perl would actually be good because trying to insert the Java program into the current implementation has caused a number of other headaches. Though perhaps that's just because of my memory issues and poor coding.
Finally, I'm not asking someone to write the program for me. Pseudo code is fine. Though if you have ideas for Perl I could use more specifics. The main thing I'm not sure how to do in Perl is the time comparison stuff. I've looked a little into Perl libraries but haven't seen anything like JodaTime (though I haven't looked much). Any thoughts or suggestions are appreciated. Thank you.
Reading all the rows in is not ideal, because you need to store the whole lot in memory.
Instead you could read line by line, writing out the records that you want to keep as you go. You could keep a cache of the rows you've hit previously, bounded to be within 15 seconds of the current program. In very rough pseudo-code, for every line you'd read:
var line = ReadLine()
DiscardAnythingInCacheOlderThan(line.Date().Minus(15 seconds);
if (!cache.ContainsSomethingMatchingCriteria()) {
// it's a line we want to keep
WriteLine(line);
}
UpdateCache(line); // make sure we store this line so we don't write it out again.
As pointed out, this assumes that the lines are in time stamp order. If they aren't, then I'd just use UNIX sort to make it so they are, as that'll quite merrily handle extremely large files.
You might read the file and output just the line numbers to be deleted (to be sorted and used in a separate pass.) Your hash map could then contain just the minimum data needed plus the line number. This could save a lot of memory if the data needed is small compared to the line size.
I want to call a few costly update methods whenever my code changes. I hit ctrl-s in Eclipse, this triggers a file save and a hot code replacement, my program checks to see that the file was saved, spends about 5 seconds crunching numbers, and then updates the screen.
I'm using this thing, which I call a few times per second:
public static long lastSourceUpdate=0;
private static boolean wasUpdated() {
File source = new File("/home/user/workspace/package/AClass.java");
long t = source.lastModified();
if (t>lastSourceUpdate+2000) { // wait for hcr
lastSourceUpdate=t;
return true;
}
return false;
}
There are problems with this approach:
Checking the file is unreliable, since compilation and hot code replace can finish a few seconds after the file changes. That's why there's a 2000ms delay above. Though the method returns true, the code I just altered isn't updated - or worse yet, Eclipse updates it halfway through the number-crunching, and the result is hopelessly scrambled.
Checking files is a hack in any case, it should check classes. The disk probably doesn't need to get involved.
It only checks one class, but I sometimes want to check a whole package, or failing that, any changes to the project at all. When a file changes, the package directory's lastModified is not changed. A recursive scan of the folders/packages would work, but isn't very elegant if the package is huge.
Looks ugly.
So, what is the best way to check for when code changes? Perhaps reflection? A serialVersionUID check? It's not like classes themselves have a compilationDate field - or do they? Is there some secret value that Eclipse updates? Is there a file that Eclipse changes with every save?
Thanks for checking this out.
Instead of comparing last modified dates, try comparing MD5 hashes of the file.
I am running a program that I've written in Java in Eclipse. The program has a very deep level of recursion for very large inputs. For smaller inputs the program runs fine however when large inputs are given, I get the following error:
Exception in thread "main" java.lang.StackOverflowError
Can this be solved by increasing the Java stack size and if so, how do I do this in Eclipse?
Update:
#Jon Skeet
The code is traversing a parse tree recursively in order to build up a datastructure. So, for example the code will do some work using a node in the parse tree and call itself on the node's two children, combining their results to give the overall result for the tree.
The total depth of the recursion depends on the size of the parse tree but the code seems to fail (without a larger stack) when the number of recursive calls gets into the 1000s.
Also I'm pretty sure the code isn't failing because of a bug as it works for small inputs.
Open the Run Configuration for your application (Run/Run Configurations..., then look for the applications entry in 'Java application').
The arguments tab has a text box Vm arguments, enter -Xss1m (or a bigger parameter for the maximum stack size). The default value is 512 kByte (SUN JDK 1.5 - don't know if it varies between vendors and versions).
It may be curable by increasing the stack size - but a better solution would be to work out how to avoid recursing so much. A recursive solution can always be converted to an iterative solution - which will make your code scale to larger inputs much more cleanly. Otherwise you'll really be guessing at how much stack to provide, which may not even be obvious from the input.
Are you absolutely sure it's failing due to the size of the input rather than a bug in the code, by the way? Just how deep is this recursion?
EDIT: Okay, having seen the update, I would personally try to rewrite it to avoid using recursion. Generally having a Stack<T> of "things still do to" is a good starting point to remove recursion.
Add the flag -Xss1024k in the VM Arguments.
You can also increase stack size in mb by using -Xss1m for example .
i also have the same problem while parsing schema definition files(XSD) using XSOM library,
i was able to increase Stack memory upto 208Mb then it showed heap_out_of_memory_error for which i was able to increase only upto 320mb.
the final configuration was -Xmx320m -Xss208m but then again it ran for some time and failed.
My function prints recursively the entire tree of the schema definition,amazingly the output file crossed 820Mb for a definition file of 4 Mb(Aixm library) which in turn uses 50 Mb of schema definition library(ISO gml).
with that I am convinced I have to avoid Recursion and then start iteration and some other way of representing the output, but I am having little trouble converting all that recursion to iteration.
When the argument -Xss doesn't do the job try deleting the temporary files from:
c:\Users\{user}\AppData\Local\Temp\.
This did the trick for me.
You need to have a launch configuration inside Eclipse in order to adjust the JVM parameters.
After running your program with either F11 or Ctrl-F11, open the launch configurations in Run -> Run Configurations... and open your program under "Java Applications". Select the Arguments pane, where you will find "VM arguments".
This is where -Xss1024k goes.
If you want the launch configuration to be a file in your workspace (so you can right click and run it), select the Common pane, and check the Save as -> Shared File checkbox and browse to the location you want the launch file. I usually have them in a separate folder, as we check them into CVS.
Look at Morris in-order tree traversal which uses constant space and runs in O(n) (up to 3 times longer than your normal recursive traversal - but you save hugely on space). If the nodes are modifiable, than you could save the calculated result of the sub-tree as you backtrack to its root (by writing directly to the Node).
When using JBOSS Server, double click on the server:
Go to "Open Launch Configuration"
Then change min and max memory sizes (like 1G, 1m):
I have a set of source folders. I use a Java class to build the distribution file out of these folders. I'd like to write another little class in Java which runs every half a second, checks if any of the files in the folders have changed, and if yes, run the building class.
So, how do I detect easily that a folder has been modified ?
I think you will need to check the directory and subdirectory modfication times (for files being added/removed) and the file modification times (for the changes in each file).
Write a recursive routine that checks a directory for it's modification time and if it's changed, plus each files. Then checks the directory contents, and call recursively for any subdirectories. You should just be able to check for any modification times greater than when you last ran the check.
See File.lastModified()
EDIT: Since I wrote the above, Java 7 came out with its directory watching capability.
Here a list of possible solutions and an example of simple File/Folder Watcher.
If you can are allowed to use Java 7, it has support for platform independent directory/file change notifications.
JNA has a sample for cross platform change notification here. Not sure how easy you might find it.
I don't know if it's any good, but here's one person's take on the problem.
Sounds like .NET has something built-in: FileSystemWatcher
UPDATE: Thanks to kd304, I just learned that Java 7 will have the same feature. Won't do you much good today, unless you can use the preview release.
You need to watch each file and keep track of the File.lastModified attribute and check the File.exists flag together with a bit of simple recursion to walk the directory structure.
with NIO2 (Java7) it will be very easy. With Java6 you could call list() and compare with previous list once a second? (a poor man watching service)
Yes, there are a number of available listeners for directories, but they're all relatively complicated and most involve threads.
A few days ago I ended up in an almost heated discussion with one of our engineers over whether it was a permissible creating a new thread (in a web application) simply to monitor a directory tree. In the end I agreed with him, but by virtue of coming up with something so fast that having a listener is unnecessary. Note: the solution described below only works if you don't need to know which file has changed, only that a file has changed.
You provide the following method with a Collection of Files (e.g., obtained via Apache IO's FileUtils.listFiles() method) and this returns a hash for the collection. If any file is added, deleted, or its modification date changed, the hash will change.
In my tests, 50K files takes about 750ms on a 3Ghz Linux box. Touching any of the files alters the hash. In my own implementation I'm using a different hash algorithm (DJB) that's a bit faster but that's the gist of it. We now just store the hash and check each time as it's pretty painless, especially for smaller file collections. If anything changes we then re-index the directory. The complexity of a watcher just wasn't worth it in our application.
/**
* Provided a directory and a file extension, returns
* a hash using the Adler hash algorithm.
*
* #param files the Collection of Files to hash.
* #return a hash of the Collection.
*/
public static long getHash( Collection<File> files )
{
Adler32 adler = new Adler32();
StringBuilder sb = new StringBuilder();
for ( File f : files ) {
String s = f.getParent()+'/'+f.getName()+':'+String.valueOf(f.lastModified());
adler.reset();
adler.update(s.getBytes());
sb.append(adler.getValue()+' ');
}
adler.reset();
adler.update(sb.toString().getBytes());
return adler.getValue();
}
And yes, there's room for improvement (e.g., we use a hash method rather than inlining it). The above is cut down from our actual code but should give you a good idea what we did.
I am writing a little program that creates an index of all files on my directories. It basically iterates over each file on the disk and stores it into a searchable database, much like Unix's locate. The problem is, that index generation is quite slow since I have about a million files.
Once I have generated an index, is there a quick way to find out which files have been added or removed on the disk since the last run?
EDIT: I do not want to monitor the file system events. I think the risk is too high to get out of sync, I would much prefer to have something like a quick re-scan that quickly finds where files have been added / removed. Maybe with directory last modified date or something?
A Little Benchmark
I just made a little benchmark. Running
dir /b /s M:\tests\ >c:\out.txt
Takes 0.9 seconds and gives me all the information I need. When I use a Java implementation (much like this), it takes about 4.5 seconds. Any ideas how to improve at least this brute force approach?
Related posts: How to see if a subfile of a directory has changed
Can you jump out of java.
You could simply use
dir /b /s /on M:\tests\
the /on sorts by name
if you pipe that out to out.txt
Then do a diff to the last time you ran this file either in Java or in a batch file. Something like this in Dos. You'd need to get a diff tool, either diff in cygwin or the excellent http://gnuwin32.sourceforge.net/packages/diffutils.htm
dir /b /s /on m:\tests >new.txt
diff new.txt archive.txt >diffoutput.txt
del archive.txt
ren new.txt archive.txt
Obviously you could use a java diff class as well but I think the thing to accept is that a shell command is nearly always going to beat Java at a file list operation.
Unfortunately there's no standard way to listen to file system events in java. This may be coming in java7.
For now, you'll have to google "java filesystem events" and pick the custom implementation that matches your platform.
I've done this in my tool MetaMake. Here is the recipe:
If the index is empty, add the root directory to the index with a timestamp == dir.lastModified()-1.
Find all directories in the index
Compare the timestamp of the directory in the index with the one from the filesystem. This is a fast operation since you have the full path (no scanning of all files/dirs in the tree involved).
If the timestamp has changed, you have a change in this directory. Rescan it and update the index.
If you encounter missing directories in this step, delete the subtree from the index
If you encounter an existing directory, ignore it (will be checked in step 2)
If you encounter a new directory, add it with timestamp == dir.lastModified()-1. Make sure it gets considered in step 2.
This will allow you to notice new and deleted files in an effective manner. Since you scan only for known paths in step #2, this will be very effective. File systems are bad at enumerating all the entries in a directory but they are fast when you know the exact name.
Drawback: You will not notice changed files. So if you edit a file, this will not reflect in a change of the directory. If you need this information, too, you will have to repeat the algorithm above for the file nodes in your index. This time, you can ignore new/deleted files because they have already been updated during the run over the directories.
[EDIT] Zach mentioned that timestamps are not enough. My reply is: There simply is no other way to do this. The notion of "size" is completely undefined for directories and changes from implementation to implementation. There is no API where you can register "I want to be notified of any change being made to something in the file system". There are APIs which work while your application is alive but if it stops or misses an event, then you're out of sync.
If the file system is remote, things get worse because all kinds of network problems can cause you to get out of sync. So while my solution might not be 100% perfect and water tight, it will work for all but the most constructed exceptional case. And it's the only solution which even gets this far.
Now there is a single kind application which would want to preserve the timestamp of a directory after making a modification: A virus or worm. This will clearly break my algorithm but then, it's not meant to protect against a virus infection. If you want to protect against this, you must a completely different approach.
The only other way to achieve what Zach wants is to build a new filesystem which logs this information permanently somewhere, sell it to Microsoft and wait a few years (probably 10 or more) until everyone uses it.
One way you could speed things up is to just iterate over the directories and check the last modified time to see if the contents of the directory have changed since your last index, and if they have just do a normal scan on the directory then and see if you can find where things changed. I don't know how portable this will be tho, but it changing the hierarchy propagates up on a Linux system (might be filesystem dependant) so you can start at the root and work your way down, stopping when you hit a directory that hasn't changed
Given that we do not want to monitor file system events, could we then just keep track of the (name,size,time,checksum) of each file? The computation of the file checksum (or cryptographic hash, if you prefer) is going to be the bottleneck. You could just compute it once in the initial run, and re-compute it only when necessary subsequently (e.g. when files match on the other three attributes). Of course, we don't need to bother with this if we only want to track filenames and not file content.
You mention that your Java implementation (similar to this) is very slow compared to "dir /s". I think there are two reasons for this:
File.listFiles() is inherently slow. See this earlier question "Is there a workaround for Java’s poor performance on walking huge directories?", and this Java RFE "File.list(FilenameFilter) is not effective for huge directories" for more information. This shortcoming is apparently addressed by NIO.2, coming soon.
Are you traversing your directories using recursion? If so, try a non-recursive approach, like pushing/popping directories to be visited on/off a stack. My limited personal experience suggests that the improvement can be quite significant.
The file date approach might not be the best. For example if you restore a file from backup. Perhaps during the indexing you could store a MD5 hash of the file contents. However you might need to do some performance benchmarking to see if the performance is acceptable
I have heared that this task is very hard to do efficiently. I'm sure MS would have implemented similar tool to Windows if it was easy, especially nowadays since HD:s are growing and growing.
How about something like this:
private static String execute( String command ) throws IOException {
Process p = Runtime.getRuntime().exec( "cmd /c " + command );
InputStream i = p.getInputStream();
StringBuilder sb = new StringBuilder();
for( int c = 0 ; ( c = i.read() ) > -1 ; ) {
sb.append( ( char ) c );
}
i.close();
return sb.toString();
}
( There is a lot of room for improvement there, since that version reads one char at a time:
You can pick a better version from here to read the stream faster )
And you use as argument:
"dir /b /s M:\tests\"
If this is going to be used in a running app ( rather and being an standalone app ) you can discount the "warm up" time of the JVM, that's about 1 - 2 secs depending on your hardware.
You could give it a try to see what's the impact.
Try using git. Version control software is geared towards this kind of problem, and git has a good reputation for speed; it's specifically designed for working fast with local files. 'git diff --name-status' would get you what you want I think.
I haven't checked the implementation or the performance, but commons-io has an listFiles() method. It might be worth a try.