We have to monitor the change on a remote system file, that we acces throught FTP, SMB.
We do not have any SSH access to the remote system / os. Our only view of the remote system is what FTP or Samba let us see.
What we do today :
periodicly scan the whole directory, construct a representation in memory for doing our stuff, and then merge it with what we have in database.
What we would like to do :
Being able to determine if the directory have change, and thus if a parsing is needed. Ideally, never have to do a full parsing. We dont want to rely too much on the OS capability ( inodes... )because it could change from a installation to another.
Main Goal : This process begin to be slow when the amount of data is very large. Only a few % of this date is new and need to be parsed. How parse and add to our database only this part ?
The leads we discuss at this moment :
Checking the size of folder
using checksum on file
Checking the last date of modification of folder / file
What we really want :
Some input and best practice, because this problem seams pretty commons, and should have bean already discussed, and we dont want to end up doing something overly complicated on this point.
Thanks in advance, a bunch of fellow developpers ;-)
We use a java/spring/hibernate stack, but i dont think that matters much here.
Edit : basicly, we acces a FTP server or equivalent. A local copy is not a option, since the amount of data is way to large.
The Remote Directory Poller for Java (rdp4j) library can help you out with polling your FTP location and notify you with the following events: file Added/Removed/Modified in a directory. It uses the lastModified date for each file in the directory and compares them with previous poll.
See complete User Guide, which contains implementations of the FtpDirectory and MyListener in below quick tutorial of the API:
package example
import java.util.concurrent.TimeUnit;
import com.github.drapostolos.rdp4j.DirectoryPoller;
import com.github.drapostolos.rdp4j.spi.PolledDirectory;
public class FtpExample {
public static void main(String[] args) throws Exception {
String host = "ftp.mozilla.org";
String workingDirectory = "pub/addons";
String username = "anonymous";
String password = "anonymous";
PolledDirectory polledDirectory = new FtpDirectory(host, workingDirectory, username, password);
DirectoryPoller dp = DirectoryPoller.newBuilder()
.addPolledDirectory(polledDirectory)
.addListener(new MyListener())
.setPollingInterval(10, TimeUnit.MINUTES)
.start();
TimeUnit.HOURS.sleep(2);
dp.stop();
}
}
You cannot use directory sizes or modification dates to tell if subdirectories have changed. Full stop. At a minimum you have to do a full directory listing of the whole tree.
You may be able to avoid reading file contents if you are satisified you can rely on the combination of the modification date and time.
My suggestion is use off-the-shelf software to create a local clone (e.g. rsync, robocopy) then do the comparison/parse on the local clone. The question "is it updated" is then a question for rsync to answer.
As previously mentioned, there is no way you can track directories via FTP or SMB. What you can do is to list all files on the remote server and construct a snapshot that contains:
for file: name, size and modification date,
for directory: name and latest modification date among its contents,
Using this information you will be able to determine which directories need to be looked into and which files need to be transferred.
The safe and portable solution is to use a strong hash/checksum such as SHA1 or (preferably) SHA512. The hash can be mapped to whatever representation you want to compute and store. You can use the following recursive recipe (adapted from the Git version control system):
The hash of a file is the hash of its contents, disregarding the name;
to hash a directory, consider it as a sorted list of filename-hash pairs in a textual representation and hash that.
Maybe prepend f to every file and d to every directory representation before hashing.
You could also put the directory under version control using Git (or Mercurial, or whatever you like), periodically git add everything in it, use git status to find out what was updated, and git commit the changes.
Related
I have folder with 1.5 millions of objects (about 5 TB of data) which has folders with the next format 123-John.
I need to copy all these folders content in the new folders with renaming it to format 123.
I want to do it by the means of java.
Obviously I can't just do it one by one like this:
ObjectListing objectListing = s3.listObjects(listObjectsRequest);
boolean processable = true;
while (processable) {
processable = objectListing.isTruncated();
renameAndCopyOneByOne(objectListing.getObjectSummaries()); // this edits name and makes call to s3.copyObject()
if (processable) {
objectListing = s3.listNextBatchOfObjects(objectListing);
}
}
it would lead to making about 1.5 millions calls to
s3.copyObject(bucket, sourceKey, bucket, destinationKey)
I wanted to do it with batch , but the thing is that it could be done only with creating of manifest file in CSV format with format like
bucketName,keyName
But this is just manifest for the objects I want to make action to. I can't list locations where to save to and specify edited folder name. And also I still have to split CSV with 1.5 millions into smaller ones and create several request to S3 to create several jobs which would be not obvious to track.
Could you please give me a hint what from AWS tools would perfectly suffice all my needs for this task?
Well, after some time spent on how to do it properly I think the only way is to make such migration by some batch job from Java, to split the load.
Because AWS does not have proper tool for my case.
Every example I see says to use a strong password but then they just slap it in the source code. That doens't seem quite right to me.
Is it possible to authenticate to the keystore as the current account so no passwords are involved?
If that's not possible, I have a requirement to not store passwords in source code, which is perfectly acceptable, but it seems like at some point a password needs to be a part of the equation, can someone point me to the most secure way to handle this?
I'm tempted to add the password as part of the build, but then I have a plaintext password on the build server, which almost seems roughly the same to me.
First, the general rule: If you ship software which, all by itself, is able to unlock some 'secure store' (so, without involvment of a server under your control or some other hardware under your control)... it is impossible to hide this information from the owner of the computer it runs on.
Example: Webbrowsers tend to have a feature to offer storing website passwords. These passwords are stored in files, and with the right tools you can open these files and see the passwords plain as day. There is no way to fix that with more software or more cryptographic algorithms. The only solution is to make the software incapable of unlocking said datastore, for example by requiring that the user enter a master password every time they wanna look into it, or putting some sort of secure enclave into the hardware and having THAT take care of the crypto. Java generally does not have the right libraries to interact with such hardware (apple's T2 is a very advanced take on this concept; TPM chips are a budget option).
So, once you're okay with that, you can still go: Okay, well, I do NOT want the authentication keys in the source code; in order to build a production distributable from the sources, the builder will have to supply it. To accomplish that:
(Assuming maven style project structure):
Make file src/main/resources/com/yourcompany/yourproject/keys/KeyFile.txt and then update your .gitignore file to ignore that, by putting /src/main/resources/com/yourcompany/yourproject/keys/KeyFile.txt in there.
Write the key in this file. Share this file with some secure means with all members of the project who should have it.
Write code: In src/main/java/com/yourcompany/yourproject/keys/ProjectKey.java have a static method to retrieve the key. It would look something like this:
public final class ProjectKey {
/* prevent instantiation */ private ProjectKey(){}
private static final String PROJECT_KEY = loadKey();
public static String getKey () {
if (PROJECT_KEY != null) return PROJECT_KEY;
throw new IllegalStateException(
"Key file not present; find somebody with the file and place in: " +
"src/main/resources/java/com/yourcompany/yourproject/keys/KeyFile.txt");
}
private static String loadKey() {
InputStream in = ProjectKey.class.getResourceAsStream("KeyFile.txt");
try {
return in == null ? null : new Scanner(in, "UTF-8").next();
} finally {
if (in != null) in.close();
}
}
}
Make a dir in your project called 'keys' or whatnot. In your .gitignore file at the root, put the line /keys in order to ensure these do not go into source control.
You'd have to mess around with some build tool plugins if you want builds to fail if the file is missing. Also, you'd have to update the scanner's delimiter if the 'key' contains any whitespace (I'm using scanner here as the fastest way to turn an inputstream into a complete string; if you have for example Guava, it has better calls to do that, and you should use those).
I'm adding code to a large JSP web application, integrating functionality to convert CGM files to PDFs (or PDFs to CGMs) to display to the user.
It looks like I can create the converted files and store them in the directory designated by System.getProperty("java.io.tmpdir"). How do I manage their deletion, though? The program resides on a Linux-based server. Will the OS automatically delete from /tmp or will I need to come up with functionality myself? If it's the latter scenario, what are good ways to go about doing it?
EDIT: I see I can use deleteOnExit() (relevant answer elsewhere), but I think the JVM runs more or less continuously in the background so I'm not sure if the exits would be frequent enough.
I don't think I need to cache any converted files--just convert a file anew every time it's needed.
You can do this
File file = File.createTempFile("base_name", ".tmp", new File(temporaryFolderPath));
file.deleteOnExit();
the file will be deleted when the virtual machine terminates
Edit:
If you want to delete it after the job is done, just do it:
File file = null;
try{
file = File.createTempFile("webdav", ".tmp", new File(temporaryFolderPath));
// do sth with the file
}finally{
file.delete();
}
There are ways to have the JVM delete files when the JVM exits using deleteOnExit() but I think there are known memory leaks using that method. Here is a blog explaining the leak: http://www.pongasoft.com/blog/yan/java/2011/05/17/file-dot-deleteOnExit-is-evil/
A better solution would either be to delete old files using a cron or if you know you aren't going to use the file again, why not just delete it after processing?
From your comment :
Also, could I just create something that checks to see if the size of my files exceeds a certain amount, and then deletes the oldest ones if that's true? Or am I overthinking it?
You could create a class that keeps track of the created files with a size limit. When the size of the created files, after creating a new one, goes over the limit, it deletes the oldest one. Beware that this may delete a file that still needs to exist even if it is the oldest one. You might need a way to know which files still need to be kept and delete only those that are not needed anymore.
You could have a timer in the class to check periodically instead of after each creation. This solution is tied to your application while using a cron isn't.
I am currently writing a program which takes user input and creates rows of a comma delimited .csv file. I am in need of a way to save this data in a way in which users are not able to easily edit this data. It does not need to be super secure, just enough so that it couldn't accidentally be edited. I also need another file (or the same file?) created to then be easily accessible (in the file system) by the user so that they may then email this file to a system admin who can then open the .csv file. I could provide this second person with a conversion program if necessary.
The file I save data in and the file to be sent can be two different files if there are any advantages to this. I was currently considering just using a file with a weird file extension, but saving it as a text file so that the user will only be able to open it if they know to try that. The other option being some sort of encryption, but I'm not sure if this is necessary and even if it was where I would start.
Thanks for the help :)
Edit: This file is meant to store the actual data being entered. Currently the data is being gathered on paper forms which are then sent to the admin to manually enter all of the data. This little app is meant to have someone else enter the data from the paper form and then tell them if they've entered it all correctly. After they've entered it all they then need to send the data to the admin. It would be preferable if the sending was handled automatically, but this app needs to be very simple and low budget and I don't want an internet connection to be a requirement.
You could store your data in a serializable object and save that. It would resist casual editing and be very simple to read and write from your app. This page should get you started: http://java.sun.com/developer/technicalArticles/Programming/serialization/
From your question, I am guessing that the uneditable file's purpose is to store some kind of system config and you don't want it to get messed up easily. From your own suggestions, it seems that even knowing that the file has been edited would help you, since you can then avoid using it. If that is the case, then you can use simple checks, such as save the total number of characters in the line as the first or last comma delimited value. Then, before you use the file, you just run a small validation code on it to verify that the file is indeed unaltered.
Another approach may just be to use a ZIP (file) of a "plain text format" (CSV, XML, other serialization method, etc) and, optionally, utilize a well-known (to you) password.
This approach could be used with other stream/package types: the idea behind using a ZIP (as opposed to an object serializer directly) is so that one can open/inspect/modify said data/file(s) easily without special program support. This may or may not be a benefit and using a password may or may not even be required, see below.
Some advantages of using a ZIP (or CAB):
The ability for multiple resources (aids in extensibility)
The ability to save the actual data in a "text format" (XML, perhaps)
Maintain competitive file-sizes for "common data"
Re-use existing tooling support (also get checksum validation for free!)
Additionally, using a non-ZIP file extension will prevent most users from casually associating the file (a similar approach to what is presented in the original post, but subtly different because the ZIP format itself is not "plain text") with the ZIP format and being able to open it. A number of modern Microsoft formats utilize the fact that the file-extension plays an important role and use CAB (and sometimes ZIP) formats as the container format for the document. That is, an ".XSN" or ".WSP" or ".gadget" file can be opened with a tool like 7-zip, but are generally only done so by developers who are "in the know". Also, just consider ".WAR" and ".JAR" files as other examples of this approach, since this is Java we're in.
Traditional ZIP passwords are not secure, and more-so is using a static password embedded in the program. However, if this is just a deterrent (e.g. not for "security") then those issues are not important. Coupled with an "un-associated" file-type/extension, I believe this offers the protection asked for in the question while remaining flexible. It may be possible to entirely drop the password usage and still prevent "accidental modifications" just by using a ZIP (or other) container format, depending upon requirement/desires.
Happy coding.
Can you set file permissions to make it read-only?
Other than doing a binary output file, the file system that Windows runs (I know for sure it works from XP through x64 Windows 7) has a little trick that you can use to hide data from anyone simply perusing through your files:
Append your output and input files with a colon and then an arbitrary value, eg if your filename is "data.csv", make it instead "data.csv:42". Any existing or non-existing file can be appended to to access a whole hidden area (and every file for every value after the colon is distinct, so "data.csv:42" != "data.csv:carrots" != "second.csv:carrots").
If this file doesn't exist, it will be created and initialized to have 0 bytes of data with it. If you open up the file in Notepad you will indeed see that it holds exactly the data it held before writing to the :42 file, no more, no less, but in reality subsequent data read from this "data.csv:42" file will persist. This makes it a perfect place to hide data from any annoying user!
Caveats: If you delete "data.csv", all associated hidden data will be deleted too. Also, there are indeed programs that will find these files, but if your user goes through all that trouble to manually edit some csv file, I say let them.
I also have no idea if this will work on other platforms, I've never thought to try it.
I need effective algorithm to keep only ten latest files on disk in particular folder to support some kind of publishing process. Only 10 files should present in this folder at any point of time. Please, give your advises what should be used here.
You can ask the File for the directory to listFiles, if there are more than 9 sort them by lastModified() and delete the files oldest (smallest number) to trim down to 9.
How about using a file system watcher like JNotify?
Register for events that you are interested in (for instance, Created event);
Mark your internal list for the number of files upon every created event.
As soon as you reach the 11th file, remove the file having oldest create date.
Or use Commons JCI FileAlterationMonitor (FAM) to monitor local filesystems and get notified about changes:
ReloadingClassLoader classloader = new ReloadingClassLoader(this.getClass().getClassLoader());
ReloadingListener listener = new ReloadingListener();
listener.addReloadNotificationListener(classloader);
FilesystemAlterationMonitor fam = new FilesystemAlterationMonitor();
fam.addListener(directory, listener);
fam.start();
This discussion may help you with file system watchers.
You'd have to poll the directory at regular intervals and delete everything that's older than the 10th oldest file in it.
Of course that leaves open to question what the "10th oldest file" actually is. The timestamp on the file might not indicate the date/time it was added to the folder after all.
So your system might actually need some independent way to keep track of files in the folder to determine when each was added, in order to delete files based on when they were
put there rather than how old the file actually is.
But that's a business requirement you don't provide (do you even know it yourself?).