Reading from new folders every x seconds and using their data - java

Currently I am writing an application in java that looks through the files in a directory (call it 'topics'). Within this directory are some number of folders named after their respective topics, maybe 'dog', 'cat', etc.
I am currently using a ScheduledExecutorService to look through this directory every 30 seconds, going through each topic folder and performing some operation on the contents with in the folder (We'll say some other independent piece of code is writing something to these topic folders, maybe a log file or something).
What would be the best way to only examine the new subdirectories each 30 seconds? If I start with just the topic dog, and somewhere between those 30 seconds the topics 'cat' and 'bird' are added, what would be the best way for me to only look through those new folders? I was thinking about comparing it to a HashSet or something, but I'm not sure what the most efficient way it would be to do this.
I ask because there could potentially be a great amount of subdirectories being created, and it seems to me problematic to try to loop through each one with something like directory.listFiles(). Any advice?

Use java's WatchService API.
The WatchService API is fairly low level, allowing you to customize it. You can use it as is, or you can choose to create a high-level API on top of this mechanism so that it is suited to your particular needs. - Java Docs
When WatchService detects a file or directory change it can notify you of watch file or directory was changed giving you the path to it. By using this api you don't have to worry about comparing two lists to see what was changed because the api provides the path of the changed file.
Take a look at this Java Tutorial "Watching a Directory for Changes" by Oracle on using the WatchService api.

Related

Do I need to call sync on file descriptor after using Files move operation?

I want to move two files to a different directory in same filesystem.
Concrete example, I want to move /var/bigFile to /var/path/bigFile, and /var/smallFile to /var/path/smallFile.
Currently I use Files.move(source, target), without any options, moving the small file first and big file second. I need this order since there is another process waiting for this files to arrive, and the order is important.
Problem is that, sometimes I see the creation date for small file being greater than the creation date for the big file, like the moving order is not followed.
Initially I thought I have to do a sync, but it does not make sense.
Given the fact that the move will actually be a simple rename, there is no system buffers included, to force them to be flushed to disk.
Timestamp for the files was checked using ls -alrt command.
Does anyone have any idea what could be wrong?

How to process 50 k files received over ftp in every 10 seconds

I have 50k machines and each machine is having a unique id.
every 10 seconds machine will send a file in machine_feed directory located in ftp server.Not all files are received at same time.
Machine will create file with it's id name.
I need to process all received files. If file is not processed in short time then machine will send new file that will override existing file and i will loose existing data.
My Solution is
I have created spring boot application contains one scheduler that execute every 1 millisecond, that will rename received file and will copy it to processing dir. current date time will be appended to each file.
I have one more job written in apache camel that will poll received file from processnig location for every 500 milisecond and will process it and insert data in DB.if error is received then it will move file in error dir.
File is not big. It contains only one line of information.
Issue is if files are less then it is doing great job. If files are increasing then though file is valid it is moving in error folder.
when camel is polling file then found zero length file and after that file is copied to error directory then it contains valid data. Some how camel is polling file that is not copied completely.
Anyone know good solution for this problem?.
Thanks in advance.
I've faced a similar problem before but I used a slightly different set of tools...
I would recommend taking a look at Apache Flume - it is a lightweight java process. This is what I used in my situation. The documentation is pretty decent so you should be able to find your way but I just thought of giving a brief introduction anyway just to get you started.
Flume has 3 main components and each of these can be configured in various ways:
Source - The component responsible for sourcing the data
Channel - Buffer component
Sink - This would represent the destination where the data needs to land
There are other optional components as well such as Interceptor - which is primarily useful for intercepting the flow and carrying out basic filtering, transformations etc.
There is wide variety of options to choose from for each of these but if none of the ones available suit your use case - you could write your own component.
Now, for your situation - following are a couple of options I could think of:
Since your file location needs almost continuous monitoring, you might want to use Flume's Spooling Directory Source that would continuously watch your machine_feed directory and pick it up as soon as the file arrives (You could choose to alter the name yourself before the file gets overwritten).
So, the idea is to pick up the file and hand it over to the processing directory and then carry on with the processing with Apache Camel as you are already doing it.
The other option would be (and this is the one I would recommend considering) - Do everything in one Flume agent.
Your flume set-up could look like this:
Spooling Directory Source
One of the interceptors (Optional: for your processing before inserting the data into the DB. If none of the available options are suitable - you could even write your own custom interceptor)
One of the channels (Memory channel - May be...)
Lastly, one of the sinks (This might just need to be a custom sink in your case for landing the data in a DB)
If you do need to write up a custom component (an interceptor or a sink), you could just look at the source code of one the default components for reference. Here's the link to the source code repository.
I understand that I've gone in a slightly different tangent by suggesting a new tool altogether but this worked magically for me as the tool is a very light weight tool with a fairly straightforward set up and configuration.
I hope this helps.

Detect a file that will be modified

I have a large directory containing files that are modified by a seperate system at varying intervals. I am running a watcher on this directory to detect which files are modified.
I'm wondering if there is some sort of trigger that occurs when a file is accessed by the system for modification. If so, the following would apply:
Using Java, is it possible to detect which files are about to be modified and make a temporary backup before that happens?
Alternately, is it possible to compare the newly modified file against it's previous version?
In this scenario, it is impossible to make a back up of every file as the files are large and there are many of them.
Example:
I have four files:
a.xml
b.xml
c.xml
d.log
b.xml has a new section added.
Is it possible to copy the newly created section into d.log?
I can think of one way which could be a possible solution to your problem.
"Maintain a log file which tracks lastModified date of each files and you can verify which file has been modified by using your log file.
--
Jitendra
No. you can not detect a file that will be modified. not until they come up with a highly accurate future predicting AI system.
your best approach would be to maintain a versioned backup of the the files. I would start with looking into some source code management system design considerations.
How would you know if the files are about to be modified? The system handles all of the file IO. The only way you could do that is to have the program doing the modification trigger the backup, and then make the modifications. For comparison, it depends on what you want. If you want a line-by-line comparison, that should be fairly simple to do using Java's file IO classes. If you just want to check if they are the same or not, you can use a checksum on both files.

Mule: How to track non-deletable & non-movable files

I have a directory with files that cannot be removed because they are used by other applications or have read only properties. This means that I can't move or delete the files like Mule does as a natural file tracking system. In order to process these files through Mule once they arrive or when they get updated without deleting/moving them from the original directory I need some sort of custom tracking. To do this I think I need to add some rules and be able to track files that are:
New files
Processed files
Updated files
For this, I thought of having a log file in the same directory that would track each file by name and date modified, but I'm not sure if this is the correct way of doing this. I would need to be able to write and read this log file and compare its content with current files in the directory in order to determine which files are new or updated. This seems to be a bit too complicated and requires me to add quite a bit of programming (maybe as groovy scripts or overriding some methods).
Is there any other simpler way to do this on Mule? If not, how should I start tackling this problem? I'm guessing I can write some java to talk to File EndPoint.
As Victor Romero pointed out, Idempotent Filter does the trick. I tried two types of Idempotent Filter to see which one works best: Idempotent Message Filter and Idempotent Secure Hash Message Filter. Both of them did the job, however I ended up using Idempotent Message Filter (no Hash) to log timestamp and filename in the simple-text-file-store.
Just after the File inbound-endpoint:
<idempotent-message-filter idExpression="#[message.inboundProperties.originalFilename+'-'+message.inboundProperties.timestamp]" storePrefix="prefix" doc:name="Idempotent Message">
<simple-text-file-store name="uniqueProcessedMessages" directory="C:\yourDirectory"/>
</idempotent-message-filter>
Only new or modified files for the purposes of my process would pass through. However Idempotent Secure Hash Message Filter should do a better job at identifying different files.

best way to check for new xml files in java

i am writing a program that parses xml files that hold tourist attractions for cities. each city has it's own xml and the nodes have info like cost, address etc... i want to have a thread on a timer to check for new xml files or more recent versions of existing ones in a specific directory. creating the thread is not the problem. i just have no idea what the best way to check for these new files or changed files is. does anyone have any suggestions as to an easy way to make do that. i was thinking of crating a csv file with names and date altered info for each file processed and then checking against this csv file when i go to check for new or altered xml, but that seems overly complicated and i would like a better solution. i have no code to offer at this point for this mechanism i am just looking for a direction to go in.
the idea is as i get xml's for different cities fitting the schema that it will update my db automatically next time the program runs or periodically if already running.
To avoid polling you should watch the directory containing the xml file. Oracle has an extensive documentation about the topic at Watching a Directory for Changes
What you are describing looks like asynchronous feeding of new info. One common pitfall on such problem is race condition : what happens if you are trying to read a file while it's being modified or if something else tries to write a file while you are reading it ? What happens if your app (or the app that edit your xml files) breaks in the middle of processing ? To avoid such problems you should move files (change name or directory) to follow their status because moves are atomical operation on normal file systems. If you want a bullet proof solution, you should have :
files being edited or transfered by an external part
files being fully edited or transfered and ready to be read by you app
files being processed
files completely processed
files containing errors (tried to process them but could not complete processing)
The 2 first are under external responsability (you just define an interface contract), the 2 latter are under yours. The cost if 4 or 5 directories (if you choose that solution), the gains are :
if there is any problem while editing-tranfering a xml file, the external app just have to restart its operation
if a file can't be processed (syntax error, oversized, ...) it is put apart for further analysis but does not prevent processing of other files
you only have to watch almost empty directories
if your app breaks in the middle of processing a file, at next start it can restart its processing.

Categories