I'm working on a scenario with two pipelines using DataFlow in Google Cloud:
Pipeline A runs in streaming mode continuously creating files in Google-Storage based on hourly-windows and some sharding like this:
data.apply(TextIO.write().to(resource.getCurrentDirectory())
.withFilenamePolicy(new PerWindowFiles(prefix))
.withWindowedWrites()
.withNumShards(42));
Pipeline B works in batching mode loading those files regularily for further processing, e.g. every hour.
Here's the Problem: which files can the pipeline B savely load from GS?
All of them -> probably not a good idea in case A is not done writing some of them and we'll get corrupted files.
based on time (like load only the files that are at least 2h old) -> will also cause issues in case A is late
some way of creating "done"-flags in A which tell B which files are done.
somehow get notified when a window's final pane is done processing -> haven't found a way to do that.
I would like the third approach, but couldn't find a way of determining when TextIO is actually done writing a file without waiting for the pipeline to finish.
The Writer of TextIO does not return another PCollection. One way would be to override the finalize method of the FileBasedSink.WriteOperation which is created somewhere inside TextIO and requires copying the whole class and eventually building a custom Sink. This is overkill in my opinion.
Anyone has ideas for an easier solution or experience how to achieve this?
TextIO.write() will write data to temporary files, and then atomically rename each successfully written temporary file to its final location. You can safely use the files matching your "prefix" in pipeline B, because temporary files will be named in a way that does not match the prefix (we explicitly accounted for your use case when deciding how to name temporary files), so all files seen by pipeline B will be complete.
Alternatively, we're about to add (link to pull request) a version of TextIO.read() that continuously ingests new files in streaming mode; when that's ready, you can use it in your pipeline B. See also http://s.apache.org/textio-sdf and the linked JIRAs.
Related
I have 50k machines and each machine is having a unique id.
every 10 seconds machine will send a file in machine_feed directory located in ftp server.Not all files are received at same time.
Machine will create file with it's id name.
I need to process all received files. If file is not processed in short time then machine will send new file that will override existing file and i will loose existing data.
My Solution is
I have created spring boot application contains one scheduler that execute every 1 millisecond, that will rename received file and will copy it to processing dir. current date time will be appended to each file.
I have one more job written in apache camel that will poll received file from processnig location for every 500 milisecond and will process it and insert data in DB.if error is received then it will move file in error dir.
File is not big. It contains only one line of information.
Issue is if files are less then it is doing great job. If files are increasing then though file is valid it is moving in error folder.
when camel is polling file then found zero length file and after that file is copied to error directory then it contains valid data. Some how camel is polling file that is not copied completely.
Anyone know good solution for this problem?.
Thanks in advance.
I've faced a similar problem before but I used a slightly different set of tools...
I would recommend taking a look at Apache Flume - it is a lightweight java process. This is what I used in my situation. The documentation is pretty decent so you should be able to find your way but I just thought of giving a brief introduction anyway just to get you started.
Flume has 3 main components and each of these can be configured in various ways:
Source - The component responsible for sourcing the data
Channel - Buffer component
Sink - This would represent the destination where the data needs to land
There are other optional components as well such as Interceptor - which is primarily useful for intercepting the flow and carrying out basic filtering, transformations etc.
There is wide variety of options to choose from for each of these but if none of the ones available suit your use case - you could write your own component.
Now, for your situation - following are a couple of options I could think of:
Since your file location needs almost continuous monitoring, you might want to use Flume's Spooling Directory Source that would continuously watch your machine_feed directory and pick it up as soon as the file arrives (You could choose to alter the name yourself before the file gets overwritten).
So, the idea is to pick up the file and hand it over to the processing directory and then carry on with the processing with Apache Camel as you are already doing it.
The other option would be (and this is the one I would recommend considering) - Do everything in one Flume agent.
Your flume set-up could look like this:
Spooling Directory Source
One of the interceptors (Optional: for your processing before inserting the data into the DB. If none of the available options are suitable - you could even write your own custom interceptor)
One of the channels (Memory channel - May be...)
Lastly, one of the sinks (This might just need to be a custom sink in your case for landing the data in a DB)
If you do need to write up a custom component (an interceptor or a sink), you could just look at the source code of one the default components for reference. Here's the link to the source code repository.
I understand that I've gone in a slightly different tangent by suggesting a new tool altogether but this worked magically for me as the tool is a very light weight tool with a fairly straightforward set up and configuration.
I hope this helps.
I'm planing on using Amazon S3 to store milions of relatively small files (~100kB-2mB). To save on upload time I structured them into directories (tens/hundreds of files per directory), and decided to use TransferManager's uploadDirectory/uploadFileList. However after uploading an individual file I need to perform specific operations on my HDD and DB. Is there any way (preferably implementing observers/listeners) to notify me whenever a specific file has finished uploading or am I cursed with only being able to verify if the entire MultipleFileUpload succeeded?
For whatever it's worth I'm using the Java SDK, however I should be able to adapt a .NET/REST solution to my needs.
Realizing that this isn't exactly what you asked, it's pretty sweet and seems like an appropriate solution...
S3 does have notifications you can configure to alert you when an object has been created or deleted (or if a reduced redundancy object was lost). These can go to SQS, SNS, or Lambda (which could potentially even run the code that updates the database), and of course if you send them to SNS you can then fan them out to multiple destinations.
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#notification-how-to-event-types-and-destinations
Don't make the mistake, however, of selecting only the upload subtype you assume is being used; use the s3:ObjectCreated:* event unless you have a specific reason not to.
I have a directory with files that cannot be removed because they are used by other applications or have read only properties. This means that I can't move or delete the files like Mule does as a natural file tracking system. In order to process these files through Mule once they arrive or when they get updated without deleting/moving them from the original directory I need some sort of custom tracking. To do this I think I need to add some rules and be able to track files that are:
New files
Processed files
Updated files
For this, I thought of having a log file in the same directory that would track each file by name and date modified, but I'm not sure if this is the correct way of doing this. I would need to be able to write and read this log file and compare its content with current files in the directory in order to determine which files are new or updated. This seems to be a bit too complicated and requires me to add quite a bit of programming (maybe as groovy scripts or overriding some methods).
Is there any other simpler way to do this on Mule? If not, how should I start tackling this problem? I'm guessing I can write some java to talk to File EndPoint.
As Victor Romero pointed out, Idempotent Filter does the trick. I tried two types of Idempotent Filter to see which one works best: Idempotent Message Filter and Idempotent Secure Hash Message Filter. Both of them did the job, however I ended up using Idempotent Message Filter (no Hash) to log timestamp and filename in the simple-text-file-store.
Just after the File inbound-endpoint:
<idempotent-message-filter idExpression="#[message.inboundProperties.originalFilename+'-'+message.inboundProperties.timestamp]" storePrefix="prefix" doc:name="Idempotent Message">
<simple-text-file-store name="uniqueProcessedMessages" directory="C:\yourDirectory"/>
</idempotent-message-filter>
Only new or modified files for the purposes of my process would pass through. However Idempotent Secure Hash Message Filter should do a better job at identifying different files.
i am writing a program that parses xml files that hold tourist attractions for cities. each city has it's own xml and the nodes have info like cost, address etc... i want to have a thread on a timer to check for new xml files or more recent versions of existing ones in a specific directory. creating the thread is not the problem. i just have no idea what the best way to check for these new files or changed files is. does anyone have any suggestions as to an easy way to make do that. i was thinking of crating a csv file with names and date altered info for each file processed and then checking against this csv file when i go to check for new or altered xml, but that seems overly complicated and i would like a better solution. i have no code to offer at this point for this mechanism i am just looking for a direction to go in.
the idea is as i get xml's for different cities fitting the schema that it will update my db automatically next time the program runs or periodically if already running.
To avoid polling you should watch the directory containing the xml file. Oracle has an extensive documentation about the topic at Watching a Directory for Changes
What you are describing looks like asynchronous feeding of new info. One common pitfall on such problem is race condition : what happens if you are trying to read a file while it's being modified or if something else tries to write a file while you are reading it ? What happens if your app (or the app that edit your xml files) breaks in the middle of processing ? To avoid such problems you should move files (change name or directory) to follow their status because moves are atomical operation on normal file systems. If you want a bullet proof solution, you should have :
files being edited or transfered by an external part
files being fully edited or transfered and ready to be read by you app
files being processed
files completely processed
files containing errors (tried to process them but could not complete processing)
The 2 first are under external responsability (you just define an interface contract), the 2 latter are under yours. The cost if 4 or 5 directories (if you choose that solution), the gains are :
if there is any problem while editing-tranfering a xml file, the external app just have to restart its operation
if a file can't be processed (syntax error, oversized, ...) it is put apart for further analysis but does not prevent processing of other files
you only have to watch almost empty directories
if your app breaks in the middle of processing a file, at next start it can restart its processing.
I have a file scanner application in Java, that keeps scanning a directory on a server using FTP. gets list of files of the directory and downloads them one by one. on the other side, on the server, there's a process that writes these files. if I'm lucky I wouldn't try to download an incomplete file but how can I make sure if the write process on the server is complete and the file handle is closed, and file is ready to be downloaded?
I have no control on the write process which is on the server. moreover, I don't have write permission on the directory to try to get a write-handle in order to check if there's already a write handle open, so this option is off the table.
Is there an FTP function addressing this problem?
This is a very old and well-known problem.
There is no way to be absolutely certain a file being written by the FTP daemon is complete. It's even possible that the file transfer failed and then gets restarted and completed. You must poll the file's size and set a time limit, say 5 minutes. If the size does not change during that time you assume the file is complete.
If possible, the program that processes the file should be able to deal with partial files.
A much better alternative is rsync, which is much more robust and deterministic. It can even be configured (via command-line option) to write the data initially to a temporary location and move it to its final destination path upon successful completion. If the file exists where you expect it, then it is by definition complete.
A possible solution would be first uploading the file with a different filename (e.g. adding ".partial") and then renaming it to its final name.
If the server finds the final name then the upload has been completed.
If you cannot control the upload process then what you are asking is impossible by definition: the file upload could stop because of a network problem or because the sending process is stopped for whatever reason.
What the receiving end will observe is just a closing of the incoming stream; there is no way to guarantee that the data will not be a partial transfer.
Other workarounds could be checking for an end-of-data marker or using a request to the sending server to check if (in their view) the transfer has been completed.
This is more fundamental than FTP: you'd have a similar problem reading those files even if they were being created on the local machine.
If you can't modify the writing process, you'll need to jump through some hoops. None are great, but some are safer than others.
Keep reading until nothing changes for some window (maybe a minute, like David Schwartz suggests). You could optimize this a bit by watching the file size.
Figure out if the files are written serially in a reliable order. When you see file N appear, you know that file N-1 is ready. (Assumes that the directory is empty before the files are written, though you could also look at timestamps.) The downside is that your logic will break if the writer ever changes order or starts writing in parallel.
The reliable, safe solutions require improving the writer process.
Writer can write the files to hidden or temporary locations and only make them visible once the entire file (or directory) is ready, using symlinks or file-moving or chmod.
Writer creates a special file (e.g., "./DONE") only after all other files have been written, and reader doesn't read any files until that file is present.
Depending on the file type, the writer could add some kind of end-of-file record/line at the end of the file, and the reader could ensure that it's present.
You can use Ftp library from Apache common API
get more information
boolean flag = retrieveFile(String remote, OutputStream local);
This flag check output stream is available of the current file.