I have a directory with files that cannot be removed because they are used by other applications or have read only properties. This means that I can't move or delete the files like Mule does as a natural file tracking system. In order to process these files through Mule once they arrive or when they get updated without deleting/moving them from the original directory I need some sort of custom tracking. To do this I think I need to add some rules and be able to track files that are:
New files
Processed files
Updated files
For this, I thought of having a log file in the same directory that would track each file by name and date modified, but I'm not sure if this is the correct way of doing this. I would need to be able to write and read this log file and compare its content with current files in the directory in order to determine which files are new or updated. This seems to be a bit too complicated and requires me to add quite a bit of programming (maybe as groovy scripts or overriding some methods).
Is there any other simpler way to do this on Mule? If not, how should I start tackling this problem? I'm guessing I can write some java to talk to File EndPoint.
As Victor Romero pointed out, Idempotent Filter does the trick. I tried two types of Idempotent Filter to see which one works best: Idempotent Message Filter and Idempotent Secure Hash Message Filter. Both of them did the job, however I ended up using Idempotent Message Filter (no Hash) to log timestamp and filename in the simple-text-file-store.
Just after the File inbound-endpoint:
<idempotent-message-filter idExpression="#[message.inboundProperties.originalFilename+'-'+message.inboundProperties.timestamp]" storePrefix="prefix" doc:name="Idempotent Message">
<simple-text-file-store name="uniqueProcessedMessages" directory="C:\yourDirectory"/>
</idempotent-message-filter>
Only new or modified files for the purposes of my process would pass through. However Idempotent Secure Hash Message Filter should do a better job at identifying different files.
Related
I have 50k machines and each machine is having a unique id.
every 10 seconds machine will send a file in machine_feed directory located in ftp server.Not all files are received at same time.
Machine will create file with it's id name.
I need to process all received files. If file is not processed in short time then machine will send new file that will override existing file and i will loose existing data.
My Solution is
I have created spring boot application contains one scheduler that execute every 1 millisecond, that will rename received file and will copy it to processing dir. current date time will be appended to each file.
I have one more job written in apache camel that will poll received file from processnig location for every 500 milisecond and will process it and insert data in DB.if error is received then it will move file in error dir.
File is not big. It contains only one line of information.
Issue is if files are less then it is doing great job. If files are increasing then though file is valid it is moving in error folder.
when camel is polling file then found zero length file and after that file is copied to error directory then it contains valid data. Some how camel is polling file that is not copied completely.
Anyone know good solution for this problem?.
Thanks in advance.
I've faced a similar problem before but I used a slightly different set of tools...
I would recommend taking a look at Apache Flume - it is a lightweight java process. This is what I used in my situation. The documentation is pretty decent so you should be able to find your way but I just thought of giving a brief introduction anyway just to get you started.
Flume has 3 main components and each of these can be configured in various ways:
Source - The component responsible for sourcing the data
Channel - Buffer component
Sink - This would represent the destination where the data needs to land
There are other optional components as well such as Interceptor - which is primarily useful for intercepting the flow and carrying out basic filtering, transformations etc.
There is wide variety of options to choose from for each of these but if none of the ones available suit your use case - you could write your own component.
Now, for your situation - following are a couple of options I could think of:
Since your file location needs almost continuous monitoring, you might want to use Flume's Spooling Directory Source that would continuously watch your machine_feed directory and pick it up as soon as the file arrives (You could choose to alter the name yourself before the file gets overwritten).
So, the idea is to pick up the file and hand it over to the processing directory and then carry on with the processing with Apache Camel as you are already doing it.
The other option would be (and this is the one I would recommend considering) - Do everything in one Flume agent.
Your flume set-up could look like this:
Spooling Directory Source
One of the interceptors (Optional: for your processing before inserting the data into the DB. If none of the available options are suitable - you could even write your own custom interceptor)
One of the channels (Memory channel - May be...)
Lastly, one of the sinks (This might just need to be a custom sink in your case for landing the data in a DB)
If you do need to write up a custom component (an interceptor or a sink), you could just look at the source code of one the default components for reference. Here's the link to the source code repository.
I understand that I've gone in a slightly different tangent by suggesting a new tool altogether but this worked magically for me as the tool is a very light weight tool with a fairly straightforward set up and configuration.
I hope this helps.
I saw the project Hive2Hive, I think it is a very good project and I am very interest.
I am involved with a project, that has to save the files of different applications in a File System just like this one.
See image here
Let me describe the structure of this project:
There are 2 Applications in 2 different servers.
The Agent:
is responsible to send the content of a folder to a Collector
This content is a snippet of the file
only sends information if the Collector asks
it listens the Collector and sends the content by the same door
after sending it, might delete or truncate the file or o nothing else
it needs the key(.ssh/id_rsa) and if necessary a user
With the key, it can do an context switch or user switch, thus being able to delete or truncate the file
it communicates with only a Collector
it needs the folder, where to search, an included list, excluded list and an action list
These lists are patterns
The action list, decides if the file will be deleted or truncated.
The Collector:
saves the content in a File System
knows, which content was saved in the File System
it can connect to many Agents
This application was developed by "our selfs" , but is very unreliable.
Files are deleted without knowing
Allot of spaghetti code and very old
Very little time to repair it and very hard to repair
When I saw the hive2hive, I see many interesting features:
File Synchronization
File Versioning and Conflict Management
File Watchdog / Change Detection (automated, configurable)
Users can use multiple clients (simulatenously)
Multiple users can use the same machine (simultaneously)
I would like to run the application headless, a program that runs in the background.
I have some questions:
How can I decide which files to synch?
I could have a list of files selected by a pattern, for example *.log
How can I send from a source server to a destination server and keep Versioning?
Is it possible to use key files?
Most of the examples were with user credentials. I prefer key file
How should I configure it?
4.1. Should I have 2 application?
One in the source the other in the destination
4.2. Should I have only one application?
Where should it be? Source or Destination
Can I keep the same format, many Agents and few Collectors?
I am the only person, that can implement this application, because of the time I am asking for help.
I would like to hear your opinion and advice.
Thank you very much in advance,
Best regards,
Luís
I have a large directory containing files that are modified by a seperate system at varying intervals. I am running a watcher on this directory to detect which files are modified.
I'm wondering if there is some sort of trigger that occurs when a file is accessed by the system for modification. If so, the following would apply:
Using Java, is it possible to detect which files are about to be modified and make a temporary backup before that happens?
Alternately, is it possible to compare the newly modified file against it's previous version?
In this scenario, it is impossible to make a back up of every file as the files are large and there are many of them.
Example:
I have four files:
a.xml
b.xml
c.xml
d.log
b.xml has a new section added.
Is it possible to copy the newly created section into d.log?
I can think of one way which could be a possible solution to your problem.
"Maintain a log file which tracks lastModified date of each files and you can verify which file has been modified by using your log file.
--
Jitendra
No. you can not detect a file that will be modified. not until they come up with a highly accurate future predicting AI system.
your best approach would be to maintain a versioned backup of the the files. I would start with looking into some source code management system design considerations.
How would you know if the files are about to be modified? The system handles all of the file IO. The only way you could do that is to have the program doing the modification trigger the backup, and then make the modifications. For comparison, it depends on what you want. If you want a line-by-line comparison, that should be fairly simple to do using Java's file IO classes. If you just want to check if they are the same or not, you can use a checksum on both files.
i am writing a program that parses xml files that hold tourist attractions for cities. each city has it's own xml and the nodes have info like cost, address etc... i want to have a thread on a timer to check for new xml files or more recent versions of existing ones in a specific directory. creating the thread is not the problem. i just have no idea what the best way to check for these new files or changed files is. does anyone have any suggestions as to an easy way to make do that. i was thinking of crating a csv file with names and date altered info for each file processed and then checking against this csv file when i go to check for new or altered xml, but that seems overly complicated and i would like a better solution. i have no code to offer at this point for this mechanism i am just looking for a direction to go in.
the idea is as i get xml's for different cities fitting the schema that it will update my db automatically next time the program runs or periodically if already running.
To avoid polling you should watch the directory containing the xml file. Oracle has an extensive documentation about the topic at Watching a Directory for Changes
What you are describing looks like asynchronous feeding of new info. One common pitfall on such problem is race condition : what happens if you are trying to read a file while it's being modified or if something else tries to write a file while you are reading it ? What happens if your app (or the app that edit your xml files) breaks in the middle of processing ? To avoid such problems you should move files (change name or directory) to follow their status because moves are atomical operation on normal file systems. If you want a bullet proof solution, you should have :
files being edited or transfered by an external part
files being fully edited or transfered and ready to be read by you app
files being processed
files completely processed
files containing errors (tried to process them but could not complete processing)
The 2 first are under external responsability (you just define an interface contract), the 2 latter are under yours. The cost if 4 or 5 directories (if you choose that solution), the gains are :
if there is any problem while editing-tranfering a xml file, the external app just have to restart its operation
if a file can't be processed (syntax error, oversized, ...) it is put apart for further analysis but does not prevent processing of other files
you only have to watch almost empty directories
if your app breaks in the middle of processing a file, at next start it can restart its processing.
We are building a service to front fetching remote static files to our android app. The service will give a readout of the current md5 checksum of a file. The concept is that we retain the static file on the device until the checksum changes. When the file changes, the service will return a different checksum and this is the trigger for the device to download the file again.
I was thinking of just laying the downloaded files down in the file system with a .md5 file next to each one. When the code starts up, I'd go over all the files and make a map of file_name (known to be unique) to checksum. Then on requests for a file I'd check the remote service (whose response would only be checked every few minutes) and compare the result against that in the map.
The more I thought about this, the more I thought someone must have already done it. So before I put time into this I was wondering if there was a project out there doing this. I did some searching but could not find any.
Yes, it's built into HTTP. You can use conditional requests and cache files based on ETags, Last-Modified, etc. If you are looking for a library that implements your particular caching scheme, it's a bit unlikely that one exists. Write one and share it on GitHub :)