I want to understand how flume-ng will handle such situation in terms of file name collisions.
Asume I have several instances of equally configured flume agents and client uses them as load balancing group.
a1.sinks.k1.hdfs.path = /flume/events/path
How flume agents will generate filenames to make them unique across agents? Does it append agent name to it somehow(names looks like numbers so it is hard to figure this out)?
Flume does not solve this problem automatically. By default HDFS sink creates new file with name equal to current timestamp (in milliseconds), so collision may occur if two files are created at the same moment.
One way to fix it is manually set different file prefixes in different sinks:
a1.sinks.k1.hdfs.filePrefix = agentX
Also you can use event headers in prefix definition. For example, if you use host interceptor, which adds to events "host" header with value of agent's hostname, you can do something like this:
a1.sinks.k1.hdfs.filePrefix = ${host}
If you need to generate unique filenames completely automatically, you can develop your own interceptor, which will add UUID header to events. See examples here.
Related
Goal
I have the task to find duplicate entries within import files, and in a later stage duplicate entries of these import files compared to a global database. The data inside of the files are personal information like name, email, address etc. The data is not always complete, and often spelled incorrectly.
The files will be uploaded by external users through a web form. The user needs to be notified when the process is done, and he / she has to be able to download the results.
Additionally to solving this task I need to assess the suitability of Apache Beam for this task.
Possible solution
I was thinking about the following: The import files are uploaded to S3, and the pipeline will either get the file location as a pub-sub event (Kafka queue), or watch S3 (if possible) for incoming files.
Then the file is read by one PTransform and each line is pushed into a PCollection. As a side output I would update a search index (inside Redis or some such). The next transform would access the search index, and tries to find matches. The end result (unique value, duplicate values) are written to an output file to S3, and the index is cleared for the next import.
Questions
Does this approach make sense - is it idiomatic for Beam?
Would Beam be suitable for this processing?
Any improvement suggestions for the above?
I would need to track the file name / ID to notify the user at the end. How can I move this meta-data through the pipeline. Do I need to create an "envelope" object for meta-data and payload, and use this object in my PCollection?
The incoming files are unbounded, but the file contents itself are bounded. Is there a way to find out the end of the file processing in an idiomatic way?
Does this approach make sense - is it idiomatic for Beam?
This is a subjective question. In general, I would say no, this is not idiomatic for Apache Beam. Apache Beam is a framework for defining ETL pipelines. The Beam programming model has no opinions or builtin functionality for deduplicating data. Deduplication is achieved through implementation (business logic code you write) or a feature of a data store (UNIQUE constraint, SELECT DISTINCT in SQL or key/value storage).
Would Beam be suitable for this processing?
Yes, Beam is suitable.
Any improvement suggestions for the above?
I do not recommend writing to a search index in the middle of the pipeline. By doing this and then attempting to read the data back in the following transform, you've effectively created a cycle in the DAG. The pipeline may suffer from race conditions. It is less complex to have two separate pipelines - one to write to the search index (deduplicate) and a second one to write back to S3.
I would need to track the file name / ID to notify the user at the end. How can I move this meta-data through the pipeline. Do I need to create an "envelope" object for meta-data and payload, and use this object in my PCollection?
Yes, this is one approach. I believe you can get the file metadata via ReadableFile class.
The incoming files are unbounded, but the file contents itself are bounded. Is there a way to find out the end of the file processing in an idiomatic way?
I'm not sure off the top, but I don't think this is possible for a pipeline executing in streaming mode.
I am trying to retrieve files from only certain subdirectories on an FTP server . For example, I want to poll for files under only subdirectories A and C.
ROOT_DIR/A/test1.xml
ROOT_DIR/B/test2.xml
ROOT_DIR/C/test3.xml
ROOT_DIR/..(there are hundreds of subdirs)
I'm trying to avoid having an endpoint for each directory since many more directories to consume from may be added in the future.
I have successfully implemented this using a single SFTP endpoint on ROOT_DIR with recursive=true in conjunction with an AntPathMatcherGenericFileFilter instance as suggested.
The issue I'm having is that every subdirectory is being search (hundreds of them) and my filter is also looking for certain filenames only. The result is only filtered after every directory is searched and this is taking far too long (minutes).
Is there any way to only consume from certain subdirectories that could be maintained in a properties file without searching every subdirectory?
I did find one possible solution using a different approach using a Timer Based Polling Consumer with the Ant filter. With this solution, a dynamic sftp endpoint (or list of dynamic sftp endpoints) can be consumed from using a ConsumerTemplate inside a bean instead of doing so within a route.
while (true) {
// receive the message from the queue, wait at most 3 sec
String msg = consumer.receiveBody("activemq:queue.inbox", 3000, String.class);
.
.
This can then be used with my existing Ant filter to select only certain files under a dynamic list of subdirectories to consume from.
In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.
I have a situation where an external system will send me 4 different files at the same time. Let's call them the following:
customers.xml (optional)
addresses.xml (optional)
references.xml (optional)
activity.xml (trigger file)
When the trigger file is sent and picked up by Camel, Camel should then look to see if file #1 exists, if it does then process it; if it doesn't then move on to file #2 and file #3 applying the same if/then logic. Once that logic has been performed, then it can proceed with file #4.
I found elements like OnCompletion and determining if body is null or not but if someone has a much better idea, I would greatly appreciate it.
As I thought this further, it turns out this was more of a sequence problem. The key here is that I would be receiving the files in batches at the same time. That being said, I created a pluggable CustomComparator.
Once I created my CustomComparator class to order my files in a given ArrayList index position, I was able to route the messages in the order I wanted them in.
We have to monitor the change on a remote system file, that we acces throught FTP, SMB.
We do not have any SSH access to the remote system / os. Our only view of the remote system is what FTP or Samba let us see.
What we do today :
periodicly scan the whole directory, construct a representation in memory for doing our stuff, and then merge it with what we have in database.
What we would like to do :
Being able to determine if the directory have change, and thus if a parsing is needed. Ideally, never have to do a full parsing. We dont want to rely too much on the OS capability ( inodes... )because it could change from a installation to another.
Main Goal : This process begin to be slow when the amount of data is very large. Only a few % of this date is new and need to be parsed. How parse and add to our database only this part ?
The leads we discuss at this moment :
Checking the size of folder
using checksum on file
Checking the last date of modification of folder / file
What we really want :
Some input and best practice, because this problem seams pretty commons, and should have bean already discussed, and we dont want to end up doing something overly complicated on this point.
Thanks in advance, a bunch of fellow developpers ;-)
We use a java/spring/hibernate stack, but i dont think that matters much here.
Edit : basicly, we acces a FTP server or equivalent. A local copy is not a option, since the amount of data is way to large.
The Remote Directory Poller for Java (rdp4j) library can help you out with polling your FTP location and notify you with the following events: file Added/Removed/Modified in a directory. It uses the lastModified date for each file in the directory and compares them with previous poll.
See complete User Guide, which contains implementations of the FtpDirectory and MyListener in below quick tutorial of the API:
package example
import java.util.concurrent.TimeUnit;
import com.github.drapostolos.rdp4j.DirectoryPoller;
import com.github.drapostolos.rdp4j.spi.PolledDirectory;
public class FtpExample {
public static void main(String[] args) throws Exception {
String host = "ftp.mozilla.org";
String workingDirectory = "pub/addons";
String username = "anonymous";
String password = "anonymous";
PolledDirectory polledDirectory = new FtpDirectory(host, workingDirectory, username, password);
DirectoryPoller dp = DirectoryPoller.newBuilder()
.addPolledDirectory(polledDirectory)
.addListener(new MyListener())
.setPollingInterval(10, TimeUnit.MINUTES)
.start();
TimeUnit.HOURS.sleep(2);
dp.stop();
}
}
You cannot use directory sizes or modification dates to tell if subdirectories have changed. Full stop. At a minimum you have to do a full directory listing of the whole tree.
You may be able to avoid reading file contents if you are satisified you can rely on the combination of the modification date and time.
My suggestion is use off-the-shelf software to create a local clone (e.g. rsync, robocopy) then do the comparison/parse on the local clone. The question "is it updated" is then a question for rsync to answer.
As previously mentioned, there is no way you can track directories via FTP or SMB. What you can do is to list all files on the remote server and construct a snapshot that contains:
for file: name, size and modification date,
for directory: name and latest modification date among its contents,
Using this information you will be able to determine which directories need to be looked into and which files need to be transferred.
The safe and portable solution is to use a strong hash/checksum such as SHA1 or (preferably) SHA512. The hash can be mapped to whatever representation you want to compute and store. You can use the following recursive recipe (adapted from the Git version control system):
The hash of a file is the hash of its contents, disregarding the name;
to hash a directory, consider it as a sorted list of filename-hash pairs in a textual representation and hash that.
Maybe prepend f to every file and d to every directory representation before hashing.
You could also put the directory under version control using Git (or Mercurial, or whatever you like), periodically git add everything in it, use git status to find out what was updated, and git commit the changes.