I am trying to retrieve files from only certain subdirectories on an FTP server . For example, I want to poll for files under only subdirectories A and C.
ROOT_DIR/A/test1.xml
ROOT_DIR/B/test2.xml
ROOT_DIR/C/test3.xml
ROOT_DIR/..(there are hundreds of subdirs)
I'm trying to avoid having an endpoint for each directory since many more directories to consume from may be added in the future.
I have successfully implemented this using a single SFTP endpoint on ROOT_DIR with recursive=true in conjunction with an AntPathMatcherGenericFileFilter instance as suggested.
The issue I'm having is that every subdirectory is being search (hundreds of them) and my filter is also looking for certain filenames only. The result is only filtered after every directory is searched and this is taking far too long (minutes).
Is there any way to only consume from certain subdirectories that could be maintained in a properties file without searching every subdirectory?
I did find one possible solution using a different approach using a Timer Based Polling Consumer with the Ant filter. With this solution, a dynamic sftp endpoint (or list of dynamic sftp endpoints) can be consumed from using a ConsumerTemplate inside a bean instead of doing so within a route.
while (true) {
// receive the message from the queue, wait at most 3 sec
String msg = consumer.receiveBody("activemq:queue.inbox", 3000, String.class);
.
.
This can then be used with my existing Ant filter to select only certain files under a dynamic list of subdirectories to consume from.
Related
As a best practice, I am trying to index a bunch of documents to Solr in one request instead of indexing one at a time. Now I have the problem that the files I am indexing are of different types (pdf, word document, text file, ...) and therefore have different metadata that gets extracted from Tika and indexed.
I'd like to have certain fields/information for all files, regardless of the type, such as creator, creation date and path for example, but I don't know how to manually add fields when I index all the files at once.
If I would index one file at a time, I could just add fields with request.setParam() but that is for the whole request and not for one file. And even if something like this is possible, how would I get information like the creator of a file in Java?
Is there a possibility to add fields for each file?
if(listOfFiles != null) {
for (File file : listOfFiles) {
if (file.isFile()) {
request.addFile(file, getContentType(file));
//add field only for this file?
}else{
//Folder, call the same method again -> recursion
request = addFilesToRequest(file, request);
}
}
}
As far as I know there is no way of submitting multiple files in the same requests. These requests are usually so heavy on processing anyway that lowering the amount of HTTP requests may not change the total processing time much.
If you want to speed it up, you can process all your files locally with Tika first (Tika is what's being used internally in Solr as well), then only submit the extracted data. That way you can multithread the extracting process, add the results to a queue and let the Solr submission process be performed as the queue grows - with all the content submitted to Solr in several larger batches (for example 1000 documents at a time).
This also allows you to scale your indexing process without having to add more Solr servers to make that part of the process go faster (if your Solr node can keep up with search traffic, it shouldn't be necessary to scale it just to process documents).
Using Tika manually also makes it easier to correct or change details while processing, such as file formats returning dates in different timezones etc. than what you expect.
In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.
I want to understand how flume-ng will handle such situation in terms of file name collisions.
Asume I have several instances of equally configured flume agents and client uses them as load balancing group.
a1.sinks.k1.hdfs.path = /flume/events/path
How flume agents will generate filenames to make them unique across agents? Does it append agent name to it somehow(names looks like numbers so it is hard to figure this out)?
Flume does not solve this problem automatically. By default HDFS sink creates new file with name equal to current timestamp (in milliseconds), so collision may occur if two files are created at the same moment.
One way to fix it is manually set different file prefixes in different sinks:
a1.sinks.k1.hdfs.filePrefix = agentX
Also you can use event headers in prefix definition. For example, if you use host interceptor, which adds to events "host" header with value of agent's hostname, you can do something like this:
a1.sinks.k1.hdfs.filePrefix = ${host}
If you need to generate unique filenames completely automatically, you can develop your own interceptor, which will add UUID header to events. See examples here.
I have a customer who ftp's a file over to our server. I have a route defined to select certain files from this directory and move them to a different directory to be processed. The problem is that it takes it as soon as it sees it and doesn't wait till the ftp is complete. The result is a 0 byte file in the path described in the to uri. I have tried each of the readLock options (masterFile,rename,changed, fileLock) but none have worked. I am using spring DSL to define my camel routes. Here is an example of one that is not working. camel version is 2.10.0
<route>
<from uri="file:pathName?initialDelay=10s&move=ARCHIVE&sortBy=ignoreCase:file:name&readLock=fileLock&readLockCheckInterval=5000&readLockTimeout=10m&filter=#FileFilter" />
<to uri="file:pathName/newDirectory/" />
</route>
Any help would be appreciated. Thanks!
Just to note...At one point this route was running on a different server and I had to ftp the file to another server that processed it. When I was using the ftp component in camel, that route worked fine. That is it did wait till the file was received before doing the ftp. I had the same option on my route defined. Thats why I am thinking there should be a way to do it since the ftp component uses the file component options in camel.
I am taking #PeteH's suggestion #2 and did the following. I am still hoping there is another way, but this will work.
I added the following method that returns me a Date that is current.minus(x seconds)
public static Date getDateMinusSeconds(Integer seconds) {
Calendar cal = Calendar.getInstance();
cal.add(Calendar.SECOND, seconds);
return cal.getTime();
}
Then within my filter I check if the initial filtering is true. If it is I compare the Last modified date to the getDateMinusSeconds(). I return a false for the filter if the comparison is true.
if(filter){
if(new Date(pathname.getLastModified()).after(DateUtil.getDateMinusSeconds(-30))){
return false;
}
}
I have not done any of this in your environment, but have had this kind of problem before with FTP.
The better option of the two I can suggest is if you can get the customer to send two files. File1 is their data, File2 can be anything. They send them sequentially. You trap when File2 arrives, but all you're doing is using it as a "signal" that File1 has arrived safely.
The less good option (and this is the one we ended up implementing because we couldn't control the files being sent) is to write your code such that you refuse to process any file until its last modified timestamp is at least x minutes old. I think we settled on 5 minutes. This is pretty horrible since you're essentially firing, checking, sleeping, checking etc. etc.
But the problem you describe is quite well known with FTP. Like I say, I don't know whether either of these approaches will work in your environment, but certainly at a high level they're sound.
camel inherits from the file component. This is at the top describing this very thing..
Beware the JDK File IO API is a bit limited in detecting whether another application is currently writing/copying a file. And the implementation can be different depending on OS platform as well. This could lead to that Camel thinks the file is not locked by another process and start consuming it. Therefore you have to do you own investigation what suites your environment. To help with this Camel provides different readLock options and doneFileName option that you can use. See also the section Consuming files from folders where others drop files directly.
To get around this problem I had my publishers put out a "done" file. This solves this problem
A way to do so is to use a watcher which will trigger the job once a file is deposed and to delay the consuming of the file to a significant amount of time, to be sure that it's upload is finished.
from("file-watch://{{ftp.file_input}}?events=CREATE&recursive=false")
.id("FILE_WATCHER")
.log("File event: ${header.CamelFileEventType} occurred on file ${header.CamelFileName} at ${header.CamelFileLastModified}")
.delay(20000)
.to("direct:file_processor");
from("direct:file_processor")
.id("FILE_DISPATCHER")
.log("Sending To SFTP Uploader")
.to("sftp://{{ftp.user}}#{{ftp.host}}:{{ftp.port}}//upload?password={{ftp.password}}&fileName={{file_pattern}}-${date:now:yyyyMMdd-HH:mm}.csv")
.log("File sent to SFTP");
It's never late to respond.
Hope it can help someone struggling in the deepest creepy places of the SFTP world...
I will read 2000 files and do some works on them with java. So I think I should use batch processing. But How could I do? My system is Windows 7.
You can use Apache Camel / Servicemix ESB in combination with ActiveMQ.
Your first step would be to write the fileNames one by one in ActiveMQ Messages. This could be done in one so called route (a separate Thread automatically by the framework). Here you have several options which component to use. There is a file component which reads files and moves them to done afterwards or you can use a simple Java Bean.
In a second route you read the Active MQ messages (single consumer if it is important to process the files in a sequence or multiple consumers if you want more performance) process the File Content in a processor or Java Bean like you want.
You can stop the Camel context any time you want (during the processing) and restart it afterwards getting the process started at the next file not yet processed by loading / consuming it from the Active MQ message queue.
Java does not provide built in support for batch processing. You need to use something like Spring Batch.
Check this out:
http://jcp.org/en/jsr/detail?id=352
This is a new "Batch" on JSR - javax.batch
You can't read files as a batch. You have the read one at a time. You can use more than one thread but I would write it single threaded first.
It doesn't matter what OS you are using.
Assuming you have the ability to work on one file, you have two options: use a file list, or recur through a directory. It gets trickier if you need to roll back changes as a result of something that happens towards the end, though. You'd have to create a list of changes to make and then commit them all at the end of the batch operation.
// first option
batchProcess(Collection<File> filesToProcess) {
for(File file : filesToProcess) processSingle(file);
}
// second option
batchProcess(File file) {
if(file.isDirectory()) {
for(File child : file.listFiles()) {
batchProcess(file);
}
} else {
processSingle(file);
}
}