Java batch processing - java

I will read 2000 files and do some works on them with java. So I think I should use batch processing. But How could I do? My system is Windows 7.

You can use Apache Camel / Servicemix ESB in combination with ActiveMQ.
Your first step would be to write the fileNames one by one in ActiveMQ Messages. This could be done in one so called route (a separate Thread automatically by the framework). Here you have several options which component to use. There is a file component which reads files and moves them to done afterwards or you can use a simple Java Bean.
In a second route you read the Active MQ messages (single consumer if it is important to process the files in a sequence or multiple consumers if you want more performance) process the File Content in a processor or Java Bean like you want.
You can stop the Camel context any time you want (during the processing) and restart it afterwards getting the process started at the next file not yet processed by loading / consuming it from the Active MQ message queue.

Java does not provide built in support for batch processing. You need to use something like Spring Batch.

Check this out:
http://jcp.org/en/jsr/detail?id=352
This is a new "Batch" on JSR - javax.batch

You can't read files as a batch. You have the read one at a time. You can use more than one thread but I would write it single threaded first.
It doesn't matter what OS you are using.

Assuming you have the ability to work on one file, you have two options: use a file list, or recur through a directory. It gets trickier if you need to roll back changes as a result of something that happens towards the end, though. You'd have to create a list of changes to make and then commit them all at the end of the batch operation.
// first option
batchProcess(Collection<File> filesToProcess) {
for(File file : filesToProcess) processSingle(file);
}
// second option
batchProcess(File file) {
if(file.isDirectory()) {
for(File child : file.listFiles()) {
batchProcess(file);
}
} else {
processSingle(file);
}
}

Related

Camel FTP recursive consumer taking too long

I am trying to retrieve files from only certain subdirectories on an FTP server . For example, I want to poll for files under only subdirectories A and C.
ROOT_DIR/A/test1.xml
ROOT_DIR/B/test2.xml
ROOT_DIR/C/test3.xml
ROOT_DIR/..(there are hundreds of subdirs)
I'm trying to avoid having an endpoint for each directory since many more directories to consume from may be added in the future.
I have successfully implemented this using a single SFTP endpoint on ROOT_DIR with recursive=true in conjunction with an AntPathMatcherGenericFileFilter instance as suggested.
The issue I'm having is that every subdirectory is being search (hundreds of them) and my filter is also looking for certain filenames only. The result is only filtered after every directory is searched and this is taking far too long (minutes).
Is there any way to only consume from certain subdirectories that could be maintained in a properties file without searching every subdirectory?
I did find one possible solution using a different approach using a Timer Based Polling Consumer with the Ant filter. With this solution, a dynamic sftp endpoint (or list of dynamic sftp endpoints) can be consumed from using a ConsumerTemplate inside a bean instead of doing so within a route.
while (true) {
// receive the message from the queue, wait at most 3 sec
String msg = consumer.receiveBody("activemq:queue.inbox", 3000, String.class);
.
.
This can then be used with my existing Ant filter to select only certain files under a dynamic list of subdirectories to consume from.

spring batch structure for parallel processing

I am seeking some guidance please on how to structure a spring batch application to ingest a bunch of potentially large delimited files, each with a different format.
The requirements are clear:
select the files to ingest from an external source: there can be multiple releases of some files each day so the latest release must be picked
turn each line of each file into json by combining the delimited fields with the column names of the first line (which is skipped)
send each line of json to a RESTFul Api
We have one step which uses a MultiResourceItemReader which processes files in sequence. The files are inputstreams which time out.
Ideally I think we want to have
a step which identifies the files to ingest
a step which processes files in parallel
Thanks in advance.
This is a fun one. I'd implement a customer line tokenizer that extends DelimitedLineTokenizer and also implements LineCallbackHandler. I'd then configure your FlatFileItemReader to skip the first line (list of column names) and pass that first line to your handler/tokenizer to set all your token names.
A custom FieldSetMapper would then receive a FieldSet with all your name/value pairs, which I'd just pass to the ItemProcessor. Your processor could then build your JSON strings and pass them off to your writer.
Obviously, you job falls into typical - reader -> processor -> writer category with writer being optional in your case ( if you don't wish to persist JSON before sending to RESTFul API) or you can call step to send JSON to REST Service as Writer if Writer is done after receiving response from service.
Anyway, you don't need a separate step to just know the file name. Make it part of application initialization code.
Strategies to parallelize your application are listed here.
You just said a bunch of files. If number of lines in those files have similar count, I would go by partitioning approach ( i.e. by implementing Partitioner interface, I will hand over each file to a separate thread and that thread will execute a step - reader -> processor -> writer). You wouldn't need MultiResourceItemReader in this case but simple single file reader as each file will have its own reader. Partitioning
If line count in those files vary a lot i.e. if one file is going to take hours and another getting finished in few minutes, you can continue using MultiResourceItemReader but use approach of Multi-threaded Step to achieve parallelism.This is chunk level parallelism so you might have to make reader thread safe.
Approach Parallel Steps doesn't look suitable for your case since your steps are not independent.
Hope it helps !!

How do I configure Apache Camel to sequentially execute processes based on a trigger file?

I have a situation where an external system will send me 4 different files at the same time. Let's call them the following:
customers.xml (optional)
addresses.xml (optional)
references.xml (optional)
activity.xml (trigger file)
When the trigger file is sent and picked up by Camel, Camel should then look to see if file #1 exists, if it does then process it; if it doesn't then move on to file #2 and file #3 applying the same if/then logic. Once that logic has been performed, then it can proceed with file #4.
I found elements like OnCompletion and determining if body is null or not but if someone has a much better idea, I would greatly appreciate it.
As I thought this further, it turns out this was more of a sequence problem. The key here is that I would be receiving the files in batches at the same time. That being said, I created a pluggable CustomComparator.
Once I created my CustomComparator class to order my files in a given ArrayList index position, I was able to route the messages in the order I wanted them in.

Camel route picking up file before ftp is complete

I have a customer who ftp's a file over to our server. I have a route defined to select certain files from this directory and move them to a different directory to be processed. The problem is that it takes it as soon as it sees it and doesn't wait till the ftp is complete. The result is a 0 byte file in the path described in the to uri. I have tried each of the readLock options (masterFile,rename,changed, fileLock) but none have worked. I am using spring DSL to define my camel routes. Here is an example of one that is not working. camel version is 2.10.0
<route>
<from uri="file:pathName?initialDelay=10s&move=ARCHIVE&sortBy=ignoreCase:file:name&readLock=fileLock&readLockCheckInterval=5000&readLockTimeout=10m&filter=#FileFilter" />
<to uri="file:pathName/newDirectory/" />
</route>
Any help would be appreciated. Thanks!
Just to note...At one point this route was running on a different server and I had to ftp the file to another server that processed it. When I was using the ftp component in camel, that route worked fine. That is it did wait till the file was received before doing the ftp. I had the same option on my route defined. Thats why I am thinking there should be a way to do it since the ftp component uses the file component options in camel.
I am taking #PeteH's suggestion #2 and did the following. I am still hoping there is another way, but this will work.
I added the following method that returns me a Date that is current.minus(x seconds)
public static Date getDateMinusSeconds(Integer seconds) {
Calendar cal = Calendar.getInstance();
cal.add(Calendar.SECOND, seconds);
return cal.getTime();
}
Then within my filter I check if the initial filtering is true. If it is I compare the Last modified date to the getDateMinusSeconds(). I return a false for the filter if the comparison is true.
if(filter){
if(new Date(pathname.getLastModified()).after(DateUtil.getDateMinusSeconds(-30))){
return false;
}
}
I have not done any of this in your environment, but have had this kind of problem before with FTP.
The better option of the two I can suggest is if you can get the customer to send two files. File1 is their data, File2 can be anything. They send them sequentially. You trap when File2 arrives, but all you're doing is using it as a "signal" that File1 has arrived safely.
The less good option (and this is the one we ended up implementing because we couldn't control the files being sent) is to write your code such that you refuse to process any file until its last modified timestamp is at least x minutes old. I think we settled on 5 minutes. This is pretty horrible since you're essentially firing, checking, sleeping, checking etc. etc.
But the problem you describe is quite well known with FTP. Like I say, I don't know whether either of these approaches will work in your environment, but certainly at a high level they're sound.
camel inherits from the file component. This is at the top describing this very thing..
Beware the JDK File IO API is a bit limited in detecting whether another application is currently writing/copying a file. And the implementation can be different depending on OS platform as well. This could lead to that Camel thinks the file is not locked by another process and start consuming it. Therefore you have to do you own investigation what suites your environment. To help with this Camel provides different readLock options and doneFileName option that you can use. See also the section Consuming files from folders where others drop files directly.
To get around this problem I had my publishers put out a "done" file. This solves this problem
A way to do so is to use a watcher which will trigger the job once a file is deposed and to delay the consuming of the file to a significant amount of time, to be sure that it's upload is finished.
from("file-watch://{{ftp.file_input}}?events=CREATE&recursive=false")
.id("FILE_WATCHER")
.log("File event: ${header.CamelFileEventType} occurred on file ${header.CamelFileName} at ${header.CamelFileLastModified}")
.delay(20000)
.to("direct:file_processor");
from("direct:file_processor")
.id("FILE_DISPATCHER")
.log("Sending To SFTP Uploader")
.to("sftp://{{ftp.user}}#{{ftp.host}}:{{ftp.port}}//upload?password={{ftp.password}}&fileName={{file_pattern}}-${date:now:yyyyMMdd-HH:mm}.csv")
.log("File sent to SFTP");
It's never late to respond.
Hope it can help someone struggling in the deepest creepy places of the SFTP world...

Queueing Multiple Downloads, looking for a producer consumer API

I have an application (a servlet, but that's not very important) that downloads a set of files and parse them to extract information. Up to now, I did those operations in a loop :
- fetching new file on the Internet
- analyzing it.
A multi-threaded download-manager seems a better solution for this problem and I would like to implement it in the fastest way possible.
Some of the downloads are dependant from others (so, this set is partially ordered).
Mutli-threaded programming is hard and if I could find an API to do that I would be quite happy. I need to put a group of files (ordered) in a queue and get the first group of files that is completely downloaded.
Do you know of any library I could use to achieve that ?
Regards,
Stéphane
You could do something like:
BlockingQueue<Download> queue = new BlockingQueue<Download>();
ExecutorService pool = Executors.newFixedThreadPool(5);
Download obj = new Download(queue);
pool.execute(obj); //start download and place on queue once completed
Data data = queue.take(); //get completely downloaded item
You may have to use a different kind of queue if the speed of each download is not the same. BlockingQueue is first in first out I believe.
You may want to look into using a PriorityBlockingQueue which will order the Download objects according to their Comparable method. See the API here for more details.
Hope this helps.

Categories