best way to check for new xml files in java

best way to check for new xml files in java - java

i am writing a program that parses xml files that hold tourist attractions for cities. each city has it's own xml and the nodes have info like cost, address etc... i want to have a thread on a timer to check for new xml files or more recent versions of existing ones in a specific directory. creating the thread is not the problem. i just have no idea what the best way to check for these new files or changed files is. does anyone have any suggestions as to an easy way to make do that. i was thinking of crating a csv file with names and date altered info for each file processed and then checking against this csv file when i go to check for new or altered xml, but that seems overly complicated and i would like a better solution. i have no code to offer at this point for this mechanism i am just looking for a direction to go in.
the idea is as i get xml's for different cities fitting the schema that it will update my db automatically next time the program runs or periodically if already running.

To avoid polling you should watch the directory containing the xml file. Oracle has an extensive documentation about the topic at Watching a Directory for Changes

What you are describing looks like asynchronous feeding of new info. One common pitfall on such problem is race condition : what happens if you are trying to read a file while it's being modified or if something else tries to write a file while you are reading it ? What happens if your app (or the app that edit your xml files) breaks in the middle of processing ? To avoid such problems you should move files (change name or directory) to follow their status because moves are atomical operation on normal file systems. If you want a bullet proof solution, you should have :
files being edited or transfered by an external part
files being fully edited or transfered and ready to be read by you app
files being processed
files completely processed
files containing errors (tried to process them but could not complete processing)
The 2 first are under external responsability (you just define an interface contract), the 2 latter are under yours. The cost if 4 or 5 directories (if you choose that solution), the gains are :
if there is any problem while editing-tranfering a xml file, the external app just have to restart its operation
if a file can't be processed (syntax error, oversized, ...) it is put apart for further analysis but does not prevent processing of other files
you only have to watch almost empty directories
if your app breaks in the middle of processing a file, at next start it can restart its processing.

Related

Do I need to call sync on file descriptor after using Files move operation?

I want to move two files to a different directory in same filesystem.
Concrete example, I want to move /var/bigFile to /var/path/bigFile, and /var/smallFile to /var/path/smallFile.
Currently I use Files.move(source, target), without any options, moving the small file first and big file second. I need this order since there is another process waiting for this files to arrive, and the order is important.
Problem is that, sometimes I see the creation date for small file being greater than the creation date for the big file, like the moving order is not followed.
Initially I thought I have to do a sync, but it does not make sense.
Given the fact that the move will actually be a simple rename, there is no system buffers included, to force them to be flushed to disk.
Timestamp for the files was checked using ls -alrt command.
Does anyone have any idea what could be wrong?

How to process 50 k files received over ftp in every 10 seconds

I have 50k machines and each machine is having a unique id.
every 10 seconds machine will send a file in machine_feed directory located in ftp server.Not all files are received at same time.
Machine will create file with it's id name.
I need to process all received files. If file is not processed in short time then machine will send new file that will override existing file and i will loose existing data.
My Solution is
I have created spring boot application contains one scheduler that execute every 1 millisecond, that will rename received file and will copy it to processing dir. current date time will be appended to each file.
I have one more job written in apache camel that will poll received file from processnig location for every 500 milisecond and will process it and insert data in DB.if error is received then it will move file in error dir.
File is not big. It contains only one line of information.
Issue is if files are less then it is doing great job. If files are increasing then though file is valid it is moving in error folder.
when camel is polling file then found zero length file and after that file is copied to error directory then it contains valid data. Some how camel is polling file that is not copied completely.
Anyone know good solution for this problem?.
Thanks in advance.

I've faced a similar problem before but I used a slightly different set of tools...
I would recommend taking a look at Apache Flume - it is a lightweight java process. This is what I used in my situation. The documentation is pretty decent so you should be able to find your way but I just thought of giving a brief introduction anyway just to get you started.
Flume has 3 main components and each of these can be configured in various ways:
Source - The component responsible for sourcing the data
Channel - Buffer component
Sink - This would represent the destination where the data needs to land
There are other optional components as well such as Interceptor - which is primarily useful for intercepting the flow and carrying out basic filtering, transformations etc.
There is wide variety of options to choose from for each of these but if none of the ones available suit your use case - you could write your own component.
Now, for your situation - following are a couple of options I could think of:
Since your file location needs almost continuous monitoring, you might want to use Flume's Spooling Directory Source that would continuously watch your machine_feed directory and pick it up as soon as the file arrives (You could choose to alter the name yourself before the file gets overwritten).
So, the idea is to pick up the file and hand it over to the processing directory and then carry on with the processing with Apache Camel as you are already doing it.
The other option would be (and this is the one I would recommend considering) - Do everything in one Flume agent.
Your flume set-up could look like this:
Spooling Directory Source
One of the interceptors (Optional: for your processing before inserting the data into the DB. If none of the available options are suitable - you could even write your own custom interceptor)
One of the channels (Memory channel - May be...)
Lastly, one of the sinks (This might just need to be a custom sink in your case for landing the data in a DB)
If you do need to write up a custom component (an interceptor or a sink), you could just look at the source code of one the default components for reference. Here's the link to the source code repository.
I understand that I've gone in a slightly different tangent by suggesting a new tool altogether but this worked magically for me as the tool is a very light weight tool with a fairly straightforward set up and configuration.
I hope this helps.

Java moving file while writing consistent

my java application is supposed to read logging data of a Snort application on a Debian server.
The Snort application runs independent from my evaluation app and writes his logs into a file.
My evaulation app is supposed to check just the new content every 5 minutes. That's why I will move the logfile, so that the Snort application has to create a new file while my app can check the already written data from the old one.
Now the question: How can I ensure that I don't destroy the file in the case, that I move it in the moment the Snort application is writing on it? Has Java a functionality to check the current actions for the file so that no data can get lost? Does the OS lock the file while writing?
Thanks for your help, Kn0rK3

Not exactly what you are looking for, but I would do this in a very different way. Either by recording the line number / timestamp of the last entry read from the log file or the position in a RandomAccessFile (the second option is more efficient for obvious reasons), and, the next time you read the file, only do it from the recorded position to the EOF (at which you can record the last read position again).
Also, you can replace the "pool every 5 minutes" to a "pool every time I get a update notification" for this file strategy.
Since I assume that you don't have control of the code of the "Snort" application, I don't think that NIO FileLocks will help you.

It should not be an issue. Typically a logging application has some sort of file-descriptor or stream open to a file. If the file gets renamed, that doesn't affect the writing application in any way -- the name is independent to the contents of the file or its location on disk. Snort should continue to write to the new file-name until it notices that the file has been renamed at which point it re-opens a new log file to the old-name and switches to writing to that one.
That's the whole reason why it reopens in the first place. To support this sort of mechanism.
Now the question: How can I ensure that I don't destroy the file in the case...
The only thing you have to worry about is that you are renaming the file to a file-name that does not already exist. I would recommend moving it to a .YYYYMMDD.HHMMSS extension or something.
NOTE: In threaded logging operations, even if the new file has been opened, you may have to wait a bit for all of the threads to switch to the new logging stream. I'm not sure how Snort works but I have seen the log.YYYYMMDD file growing even after the log file was re-opened. I just wait a minute before I consume the renamed logfile. FYI.

How to be sure a file has been successfully written?

I'm adding autosave functionality to a graphics application in Java. The application periodically autosaves the current document and also autosaves on exit. When the user starts the application, the autosave file is reloaded.
If the autosave file is corrupted in any way (I assume a power cut when the file is in the middle of being saved would do this?), the user will lose their work. How can I prevent such situations and do all I can to guarantee that the autosave document is in a consistent state?
To further complicate matters, to autosave the document I need to save one .xml file and several .png files. Also, the .png saving occurs in C code over JNI.
My current strategy is to write each .png with the extension .png.tmp, write the .xml file with the extension .xml.tmp, and then rename each file to remove the .tmp part leaving the .xml until last. On startup, I only load the autosave document if I can find a .xml file and ignore .xml.tmp files. I also don't delete the previous autosave document until the .xml.tmp file for the new document is renamed.
I guess my knowledge of what happens when you write to disk is poor. I know you can have software read/write buffers when using files, as well as OS and hardware buffers and that all of these need to be flushed. I'm confused how I can know for sure when something really has been written to disk and what I can do to protect myself. Does the renaming operation do anything to make sure buffers are flushed?

If the autosave file is corrupted in any way (I assume a power cut when the file is in the middle of being saved would do this?), the user will lose their work. How can I prevent such situations and do all I can to guarantee that the autosave document is in a consistent state?
To prevent loss of data due to partially written autosave file, don't overwrite the autosave file. Instead, write to a new file each time, and then rename it once the file has been safely written.
To guard against not noticing that an autosave file has not been correctly written:
Pay attention to the exceptions thrown as the autosave file is written and closed in case a disc error, file system full, etc.
Keep a running checksum of the file as it is written and write it at the end of the file. Then when you load the autosave file, check that the checksum is there and is correct.
If the checkpointed state involves multiple files, make sure that you write the files in a well known order (without overwriting!), and write the checksum on the autosave file after all of the other files have been safely closed. You might want to create a directory for each checkpoint.
FOLLOW UP
No. I'm not saying that rename always succeeds. However, it is atomic - it either succeeds (and completes) or the file system is not changed. So, if you do this:
write "file.new" and close,
delete "file",
rename "file.new" to "file"
then provided the first step succeeds you are guaranteed to have the latest "file" safely on disc. And it is simple to add a couple of steps so that you have a backup of "file" at all times. (If the 3rd step fails, you are left with "file.new" and no "file". This can be recovered manually, or automatically by the application next time you run it.)
Also, I'm not saying that writes always succeed, or that applications don't crash, or that the power never goes off. And the point of the checksum is to allow you to detect the cases where these things have happened and the autosave file is incomplete.
Finally, it is a good idea to have two autosaves in case your application gets itself into a state where its data structures are messed up and the last autosave is nonsensical as a result. (The checksum won't protect against this.) Be cautious about autosaving when the application crashes for the same reason.

As an aside, since you have several different files as part of this one document, consider using either a project directory to hold them all together, or using some encapsulation format (like .zip) to put them all inside one file.
What you want to do is atomically replace the old backup files with new ones. Unfortunately, I don't believe that Java gives you enough control do this directly. You also need to reason about what operations are atomic in the underlying operating system. I know Linux file systems, so my answer will be biased towards a Java program running on that system. I would be shocked if Windows didn't do the same thing, but I can't say for certain.
Most Linux file systems (e.g. the meta-data journaled ones) let you rename files atomically. If the system crashes half-way through a rename, when you restart, it will be as if you never renamed a file in the first place. For this reason, a common way to atomically update an existing file F is to write your new data to a temporary file T and then rename T to F. Any system or application crash up to that rename will not affect F, so it will always be consistent.
Of course, before you rename, you need to make sure that your temporary file is consistent. Make sure that all streaming buffers for the file are flushed to the OS (Channel.force() or OutputStream.flush()) and the OS buffers are flushed to the disk (FileOutputStream.getFD.sync()). Of course, unless your OS disables the write cache on the hard disk itself (it probably hasn't), there's still a chance that your data can be corrupted. Add a checksum to the XML if you really want to be really sure. If you're truly paranoid, you should flush the OS and hard disk buffer caches and re-read the file to verify that it is consistent. This is beyond any reasonable expectation for normal consumer applications.
But that's just to atomically write write a single file. Your propblem is more complex: you have many files to update atomically. For example, I'll say that you have two files, img.png and main.xml. I'd do one of these:
The easy solution is to make a per-savefile directory. You wouldn't need to worry about renaming each individual file, and you could still atomically rename the new backup dir over the old backup dir you're replacing. That is, if your old backup is bak/img.png and bak/main.xml, write bak.tmp/img.png and bak.tmp/main.xml and rename bak.tmp to bak.
Name the new auxiliary files something else and let them coexist with the old ones for a little while. That is, write img.2.png and main.xml.tmp (which should refer to img.2.png, not img.png) and only rename main.xml.tmp to main.xml. Then delete img.png.
addition: If you don't have atomic renames, the next best thing extends on #2. Whenever you save the project, give it a new name (e.g. ver342.xml). When you load, just find the most recent XML that is consistent (i.e. its checksum verifies). Keep around 2 or 3 to be safe. Only delete an auto-save if you have successfully restored from a more-recent copy.

Detecting file being reopened in Java

I'm working on a small Java application (Java 1.6, Solaris) that will use multiple background threads to monitor a series of text files for output lines that match a particular regex pattern and then make use of those lines. I have one thread per file; they write the lines of interest into a queue and another background thread simply monitors the queue to collect all the lines of interest across the whole collection of files being monitored.
One problem I have is when one of the files I'm monitoring is reopened. Many of the applications that create the files I'm monitoring will simply restart their logfile when they are restarted; they don't append to what's already there.
I need my Java application to detect that the file has been reopened and restart following the file.
How can I best do this?

Could you keep a record of each of the length of each file? When the current length subsequently goes back to zero or is smaller than the last time you recorded the length, you know the file has been restarted by the app?

using a lockfile is a solution as Jurassic mentioned.
Another way is to try and open the file while you're reading the pattern and find if the file has a new size and create time. If the create time is NOT same as when you found it, then you can be sure that it has been recreated.

You could indicate somewhere on the filesystem that indicates you are reading a given file. Suppose next to the file being read (a.txt), you create a file next to it (a.txt.lock) that indicates a.txt is being read. When your process is done with it, a.txt.lock is deleted. Every time a process goes to open a file to read it, it will check for the lock file beforehand. If there is no lockfile, its not being used. I hope that makes sense and answers your question. cheers!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.