I have a distant server which generate files. The server push files each 15 min to hadoop cluster. These files are stored into a specific directory. We used flume to read files from local directory and send them to HDFS. However, SpoolDir is The suitable to process data.
The problem is flume shut down processing while the file is written into the directory.
I don't know how to make flume spooldir wait for a complete write of file , then process it.
Or how to block reading the file until it completly written, using a script shell or processor .
Someone can help me!
Set pollDelay property for spool source.
Spool dir source polls for new file at specific interval in the given directory.
By default value is 500ms.
Which is too fast for many systems so you should configure it accordingly.
Related
I am trying to read a CSV file kept in my local filesystem in UNIX, while running it in cluster mode it's not able to find the CSV file.
In local mode, it can read both HDFS and file:/// files. However, in cluster mode, it can only read HDFS file.
Is there any suitable way to read without copying it into HDFS?
Remember that the executor needs to be able to access the file, so you have to take a stand from the executor nodes. As you mention HDFS, it means that the executor nodes must have access to your HDFS cluster.
If you want the Spark cluster to access a local file, consider NFS/SMB etc. However, something will end up copying the data.
I can update my answer if you add more details on your architecture.
We have files of up to 8GB that contain structured content, but important metadata is stored on the last line of the file which needs to be appended to each line of content. It is easy to use a ReverseFileReader to grab this last line, but that requires the file to be static on disk, and I cannot find a way to do this within our existing Nifi flow? Is this possible before the data is streamed to the content repository?
Processing 8 GB file in Nifi might be inefficient. You may try other option :-
ListSFTP --> ExecuteSparkInteractive --> RouteOnAttributes ----> ....
Here, you don't need to actually flow data through Nifi, Just pass file location ( could be hdfs or non-hdfs location) in nifi attribute and write either pyspark or spark scala code to read that file ( you can run this code through ExecuteSparkInteractive ). Code will be executed on spark cluster and only job result will be sent back to Nifi which you can further use to route your nifi flow (using RouteOnAttribute processor).
Note : You need Livy setup to run spark code from Nifi.
Hope this is helpful.
There is a process which dumps 10k files in a shared NFS drive. I need to read and process the data from files. I have written java code which works great in a single node env. But when the code is deployed in WAS cluster with 4 nodes, the nodes are picking and processing the same files.
How can I avoid this? Is there some sort of file lock feature that I can use to fix this issue? Any help is highly appreciated.
More info:
I am using org.apache.commons.io.monitor library to poll the NFS directory every 10secs. Then, we read and process the files and then move the file to a post process folder. As mentioned, this works great in a single node env. When deployed in cluster, the nodes are polling the same file and processing them which is causing multiple calls with same data to a backend service.
I am looking for optimal solution.
PS:The application which processes the files doesn't have access to any kind of database.
Thanks in advance
"Is there some sort of file lock feature that I can use to fix this issue?" Not without doing some work on your end. You could create another file with the same name ending in .lock and have the application check to see if a lock file exists by creating the lock file and if it succeeds then it will process the file. If it fails it then knows one of the other cluster members already grabbed the lock file.
I am writing simple command line application, which copies files from ftp server to local drive. Lets assume that I am using the following route definition:
File tmpFile = File.createTempFile("repo", "dat");
IdempotentRepository<String> repository = FileIdempotentRepository.fileIdempotentRepository(tmpFile);
from("{{ftp.server}}")
.idempotentConsumer(header("CamelFileName"), repository)
.to("file:target/download")
.log("Downloaded file ${file:name} complete.");
where ftp.server is something like:
ftp://ftp-server.com:21/mypath?username=foo&password=bar&delay=5
Let's assume that files on the ftp server will not change over time. How do I check, whether the coping has finished or there are still some more file to copy? I need this, because I want to finish my app, once all file are copied.
Read about batch consumer
http://camel.apache.org/batch-consumer.html
The ftp consumer will set some exchange properties with the number of files, and if its the last file etc.
Do you have any control over the end that publishes the FTP files? E.g. is it your server and your client or can you make a request as a customer?
If so, you could ask for a flag file to be added at the end of their batch process. This is a single byte file with an agreed name that you watch for - when that file appears you know the batch is complete.
This is a useful technique if you regularly pull down huge files and they take a long time for a batch process to copy to disk at the server end. E.g. a file is produced by some streaming process.
I am very new to Apache camel and I am exploring how to create a rout which pulls data from ftp for instance each 15 minutes and pulls only new or updated files, so if some files were downloaded early and still the same (unchanged) ftp loader should not load them to the destination folder.
Any advices are warmly appreciated.
UPDATE #1
I've already noticed that I need to look at the FTP2, and actually I've already made a progress, the last thing that I want to clarify: consumer.dealy defines delay between each download attempt, for instance consumer.delay = 5s and at the first attempt ftp contains 5 files, consumer pulls data to somewhere and waites 5s at the second attempt ftp still the same and camel just does nothing, after that to ftp arrives additional 5 files and after 5 seconds ftp consumer downloads these just arrived new files or consumer.delay just makes consumer wait between each download of file (file#1 -> 5s -> file#2 -> 5s -> etc...)
I want to achieve first scenario.
Also, I observed that once some files were downloaded to the destination folder, I mean from ftp to local file system, this files will be ignored in subsequent data loads, even if this files were deleted on the local file system, how I can tell to camel to download again deleted files, how it stores information about already loaded files? And it seems that it downloads all files each time even files were downloaded at first data pull. Do I need to write a filter to exclude already downloaded files?
there is FTP component for apache camel http://camel.apache.org/ftp.html
use "consumer.delay" property to pull data for delay in milliseconds between each poll.
for implementation details look here http://architects.dzone.com/articles/apache-camel-integration