There is a process which dumps 10k files in a shared NFS drive. I need to read and process the data from files. I have written java code which works great in a single node env. But when the code is deployed in WAS cluster with 4 nodes, the nodes are picking and processing the same files.
How can I avoid this? Is there some sort of file lock feature that I can use to fix this issue? Any help is highly appreciated.
More info:
I am using org.apache.commons.io.monitor library to poll the NFS directory every 10secs. Then, we read and process the files and then move the file to a post process folder. As mentioned, this works great in a single node env. When deployed in cluster, the nodes are polling the same file and processing them which is causing multiple calls with same data to a backend service.
I am looking for optimal solution.
PS:The application which processes the files doesn't have access to any kind of database.
Thanks in advance
"Is there some sort of file lock feature that I can use to fix this issue?" Not without doing some work on your end. You could create another file with the same name ending in .lock and have the application check to see if a lock file exists by creating the lock file and if it succeeds then it will process the file. If it fails it then knows one of the other cluster members already grabbed the lock file.
Related
I am trying to read a CSV file kept in my local filesystem in UNIX, while running it in cluster mode it's not able to find the CSV file.
In local mode, it can read both HDFS and file:/// files. However, in cluster mode, it can only read HDFS file.
Is there any suitable way to read without copying it into HDFS?
Remember that the executor needs to be able to access the file, so you have to take a stand from the executor nodes. As you mention HDFS, it means that the executor nodes must have access to your HDFS cluster.
If you want the Spark cluster to access a local file, consider NFS/SMB etc. However, something will end up copying the data.
I can update my answer if you add more details on your architecture.
My requirements is-
To push the file into windows servers from other windows server. Files will be near 1000s of in 1 batch and it will run these batches to sftp locations
So i need to config multithreads to process these files
Issues :
The first thing i m picking list of files from source directory and that list i am iterating which is getting iterated successfully like all the thread picking the file separately nothing is colliding.
Then i need to connect with sftp...
Heres the question.
I need create a separate session for each thread right? And also separate channels to.
Is Jsch.jar SFTPConnection work for windows server to?
And the third when i am multithreading and going to call
Channelsftp.put(src,des) its not putting some pipe closed some time no file sometime inderoutbound exception some times input stream is closed. How to config that if possible? Like connection is created for each thread when it comes to putting the file into sftp location its not working
Please if u have any push condition let me for multi threading it will help
I have developed Document Management System (DMS) having OCR feature. However, it takes too much time to process, as well as high CPU usage.
My current process is synchronous, as below :
User upload his file
OCR process
Store document information in DB
Considering the real-time production load, I want to make above second step asynchronous, on a dedicated file processing separate server.
My questions are,
Is it the right way to do it?
How to send/retrieve that file to another server to process? I also found out to use message queue, but I can not add whole file in it.
Is there anyway, we can acknowledge process completion?
Just to close this question, I have separated OCR process successfully on separate file processing server, which really helps me to resolve high CPU usage, using FIFO method.
Followed below steps :
User uploads file
OCR status pending
Separate server process file, which is pending as per FIFO method once at a time.
Update OCR process status in the database.
Processing server can be increased later, as per need and load of the server.
I have a distant server which generate files. The server push files each 15 min to hadoop cluster. These files are stored into a specific directory. We used flume to read files from local directory and send them to HDFS. However, SpoolDir is The suitable to process data.
The problem is flume shut down processing while the file is written into the directory.
I don't know how to make flume spooldir wait for a complete write of file , then process it.
Or how to block reading the file until it completly written, using a script shell or processor .
Someone can help me!
Set pollDelay property for spool source.
Spool dir source polls for new file at specific interval in the given directory.
By default value is 500ms.
Which is too fast for many systems so you should configure it accordingly.
I am writing simple command line application, which copies files from ftp server to local drive. Lets assume that I am using the following route definition:
File tmpFile = File.createTempFile("repo", "dat");
IdempotentRepository<String> repository = FileIdempotentRepository.fileIdempotentRepository(tmpFile);
from("{{ftp.server}}")
.idempotentConsumer(header("CamelFileName"), repository)
.to("file:target/download")
.log("Downloaded file ${file:name} complete.");
where ftp.server is something like:
ftp://ftp-server.com:21/mypath?username=foo&password=bar&delay=5
Let's assume that files on the ftp server will not change over time. How do I check, whether the coping has finished or there are still some more file to copy? I need this, because I want to finish my app, once all file are copied.
Read about batch consumer
http://camel.apache.org/batch-consumer.html
The ftp consumer will set some exchange properties with the number of files, and if its the last file etc.
Do you have any control over the end that publishes the FTP files? E.g. is it your server and your client or can you make a request as a customer?
If so, you could ask for a flag file to be added at the end of their batch process. This is a single byte file with an agreed name that you watch for - when that file appears you know the batch is complete.
This is a useful technique if you regularly pull down huge files and they take a long time for a batch process to copy to disk at the server end. E.g. a file is produced by some streaming process.