I am using Spring Batch to read from a CSV file and write the lines on the screen.
My job is composed of 3 parts:
Part 1 : Verify if the CSV file exists in some INPUT directory on my disk, if it returns TRUE the file will be moved to another directory called PROD.
Part 2 : Extract data from the CSV file using FlatFileItemReader.
Part 3 : Write the all the items to the screen.
The problem is the FlatFileItemReader throws org.springframework.batch.item.ItemStreamException: Failed to initialize the reader caused by java.lang.IllegalArgumentException: Input resource must be set
Here is my code:
#Bean
public FlatFileItemReader<UniversInvestissement> reader() {
FlatFileItemReader<UniversInvestissement> reader = new FlatFileItemReader<>();
File csvFile = new File("C://INPUT/data.csv");
Resource resource = resourceLoader.getResource("file:" + csvFile.getAbsolutePath());
reader.setLinesToSkip(1);
reader.setResource(resource);
DefaultLineMapper lineMapper = new DefaultLineMapper();
DelimitedLineTokenizer tokenizer = new DelimitedLineTokenizer();
tokenizer.setNames(new String[]{"COL1", "COL2", "COL3", "COL4"});
tokenizer.setDelimiter(";");
FieldSetMapper fieldSetMapper = new UniversInvestissementFieldSetMapper();
lineMapper.setLineTokenizer(tokenizer);
lineMapper.setFieldSetMapper(fieldSetMapper);
reader.setLineMapper(lineMapper);
reader.setEncoding("Cp1252");
return reader;
}
#Bean
public UniversInvestissementWriter writer() {
return new UniversInvestissementWriter();
}
#Bean
public UniversInvestissementProcessor processor() {
return new UniversInvestissementProcessor();
}
#Bean
public Step extractData() {
return steps.get("extractData")
.<UniversInvestissement, UniversInvestissementProcessorResult>chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
Actually the problem is that when the FlatFileItemReader is initialized it can't find the CSV file as a resource !
Is there a way to postpone the resource assignment and avoid this exception ?
you Can use reader.setStrict(false); if you set strict mode to false the reader will Not throw an exception on. You might have to use #StepScope to make reader lazy. I am using same setup and it's working fine for me , Hope this helps you
Verify if the CSV file exists in some INPUT directory on my disk, if
it returns TRUE the file will be moved to another directory called
PROD
This problem can easly be solved using a JobExecutionDecider
class Checker implements JobExecutionDecider {
FlowExecutionStatus decide(...) {
if(<file not found in INPUT/ dir>) {
return FlowExecutionStatus.STOPPED;
}
if(!<copy file from INPUT/ to PROD/ works>) {
return FlowExecutionStatus.FAILED;
}
return FlowExecutionStatus.COMPLETED;
}
}
Of course, extractData() must be changed to insert use of programmatic flow decision (check here for a simple example)
I think that problem in your resourceLoader, because such exception thrown by non-null assertion of resource instance. So you resourceLoader return null value.
Try to use FileSystemResource and without any resource loaders. For example:
reader.setResource(new FileSystemResource(csvFile));
Related
I'm new to Spring Integration and I have a project where I want to process the contents of a zip file. The zip contains n number of tiff files and one xml file. The xml file contains information about how many tiff files should be in the zip so I need to compare the number of tiff files with the info in the xml file.
My process is: poll a directory for a zip file, move the zip file to a "working" directory, unzip the file, find the xml file and read the contents, count the tiff files and confirm that the names and number of files matches the xml data.
I have all of the steps working except for the stage where I try to read the zip and tiff files. My problem is that the UnZipTransformer creates an intermediate directory with a name in uuid format below the working directory and I'm unable to work out how to get the files from this directory.
directory structure after the unzip has happened
working/
0eca3f50-eedb-9ebd-5a3a-4ceb3ad8abbd/
01.tif
02.tif
my.xml
This Flow works. It extracts the contents of the zip file.
#Configuration
public class FileUnzipIntegrationFlow {
public static final String UNZIP_OUTPUT = "unzipOutputChannel";
#Value("${unzipped.dir}")
String unzippedDir;
#Value("${working.dir}")
String workingDir;
#Bean
public MessageSource<File> unzippedDirectory() {
FileReadingMessageSource messageSource = new FileReadingMessageSource();
messageSource.setDirectory(new File(unzippedDir));
return messageSource;
}
#Bean
public IntegrationFlow fileUnzipper() {
return IntegrationFlows.from(unzippedDirectory(), c -> c.poller(Pollers.fixedDelay(1000)))
.filter(source -> ((File)source).getName().endsWith(".zip"))
.transform(unZipTransformer())
.log()
.get();
}
#Bean(name = UNZIP_OUTPUT)
public SubscribableChannel unzipOutputChannel()
{
return MessageChannels.publishSubscribe(UNZIP_OUTPUT)
.get();
}
#Bean
#Transformer(inputChannel = "input", outputChannel = UNZIP_OUTPUT)
public UnZipTransformer unZipTransformer() {
UnZipTransformer unZipTransformer = new UnZipTransformer();
unZipTransformer.setExpectSingleResult(false);
unZipTransformer.setZipResultType(ZipResultType.FILE);
unZipTransformer.setWorkDirectory(new File(workingDir));
unZipTransformer.setDeleteFiles(false);
return unZipTransformer;
}
}
I can't work out how to get to the sub-directory in this Flow
import static com.santanderuk.spring.integration.FileUnzipIntegrationFlow.UNZIP_OUTPUT;
#Configuration
public class XmlVerificationFlow {
#Value("${working.dir}")
String workingDir;
#Bean
public IntegrationFlow xmlVerfier() {
return IntegrationFlows.from(xmlWorkingDirectory(),
c -> c.poller(Pollers.fixedRate(1000).maxMessagesPerPoll(1))).
filter(source -> ((File)source).getName().toLowerCase().endsWith(".xml")).
handle(xmlFileHandler()).
get();
}
#Bean
public MessageSource<File> workingDirectory() {
FileReadingMessageSource messageSource = new FileReadingMessageSource();
messageSource.setDirectory(new File(workingDir));
return messageSource;
}
The snippet above only works when I manually move the xml file from the sub-directory into the working directory. I can also see the payload value in the logging, which has the directory name I need but I have not been able to find out how to access this information
2022-03-28 19:39:14.395 INFO 7588 --- [ scheduling-1] o.s.integration.handler.LoggingHandler : GenericMessage [payload={10000001.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000001.tif, 10000002.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000002.tif, 10000003.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000003.tif, 10000004.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000004.tif, 10000005.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000005.tif, 10000006.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000006.tif, 10000007.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000007.tif, 10000008.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000008.tif, 10000009.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000009.tif, 10000010.tif=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\10000010.tif, 20220304S092800.XML=processing\working\1cd8f803-2e45-dfe2-1c39-99b0d74f83f0\20220304S092800.XML}, headers={file_originalFile=processing\unzipped\202203040001.zip, id=2835eb9e-ff3b-71bf-7432-4967a1f808f6, file_name=202203040001.zip, file_relativePath=202203040001.zip, timestamp=1648492754392}]
It looks like you just ignore the power of the UnZipTransformer. It does return for us something like this:
final SortedMap<String, Object> uncompressedData = new TreeMap<>();
which has a content like this in your case:
uncompressedData.put(zipEntryName, destinationFile);
So, you don't need that extra flow to poll a working dir when you can work easily with the result of unzipping.
On the other hand the FileReadingMessageSource can be configured with the RecursiveDirectoryScanner to let it to iterate sub-dirs recursively. By default it scans only the top dir and ignore all the sub-dirs.
Thanks very much Artem. As suggested, I used the SortedMap that was returned by the UnZipTransformer in a ZipOutputHandler to count the tiff files and record the xml file name.
Updated Flow
#Bean
public IntegrationFlow fileUnzipper() {
return IntegrationFlows.from(unzippedDirectory(), c -> c.poller(Pollers.fixedDelay(1000)))
.filter(source -> ((File) source).getName().endsWith(".zip"))
.transform(unZipTransformer())
.handle(zipOutputHandler(), "process")
.log()
.get();
}
process method in the Handler
public File process(SortedMap<String, Object> uncompressedData) {
uncompressedData.forEach((s, o) -> {
if (s.endsWith("tif")) {
tiffCount++;
tiffNames.add(s);
}
if (s.endsWith("XML")) {
xmlName = s;
xmlFile = new File(o.toString());
extractedDirectory = xmlFile.getParentFile();
}
});
I have a folder with thousands of text files with JSON content that I need to read, convert into a POJO and then save into a MySQL database. I intend to use a Spring Batch application.
Here is the issue so far, the research I have done only shows reading multiple CSV files or XML files and no JSON data. Specifically, I need to convert this method for parsing a CSV file into a JSON parser.
#Bean
public FlatFileItemReader<Person> reader() {
FlatFileItemReader<Person> reader = new FlatFileItemReader<Person>();
reader.setLineMapper(new DefaultLineMapper<Person>() {{
setLineTokenizer(new DelimitedLineTokenizer() {{
setNames(new String[] {"firstname", "lastname", "email", "age"});
}});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
setTargetType(Person.class);
}});
}});
return reader;
}
This code parses a JSON file:
JSONParser parser = new JSONParser();
Object obj = parser.parse(new FileReader("C:\\path\\sample.json"));
The method might be something like this
#Bean
Public FileReader<Person> reader() {
FileReader<Person> reader = new FileReader<Person>();
/**** need help on what to do here ****/
return reader;
}
Also seeing that I am reading all the files in a directory, I am passing the value of that directory in this format
#Value(value="C:\\path\\*.json")
private Resource[] resources;
So I need help on how to use this value (directory for all files) instead of what I showed earlier (single file location)
Object obj = parser.parse(new FileReader("C:\\path\\sample.json"));
You can use the MultiResourceItemReader with a JsonItemReader as delegate. Here is a quick example:
#Bean
public JsonItemReader<Person> jsonItemReader(Resource[] resources) {
JsonItemReader<Person> delegate = new JsonItemReaderBuilder<Person>()
.jsonObjectReader(new JacksonJsonObjectReader<>(Person.class))
.name("personItemReader")
.build();
MultiResourceItemReader<Person> reader = new MultiResourceItemReader<Person>();
reader.setDelegate(delegate);
reader.setResources(resources);
return reader;
}
You can find more details about the JsonItemReader in the reference documentation.
Hope this helps.
https://docs.spring.io/spring-batch/4.0.x/reference/html/readersAndWriters.html#multiFileInput - I only found this after I found this answer
Now my previous usage scenarios is like below:
using FlatFileItemReader read input stream with .txt file line by line
using ItemProcessor process per line data to invoke remote service with http
using FlatFileItemWriter write result of per request into the file
I would like to process remote calling with multi thread with ItemProcessor in step 2
Main flow code like below (with spring boot):
//read data
FlatFileItemReader<ItemProcessing> reader = read(batchReqRun);
//process data
ItemProcessor<ItemProcessing, ItemProcessing> processor = process(batchReqDef);
//write data
File localOutputDir = new File(localStoragePath+"/batch-results");
File localOutputFile = new File(localOutputDir, batchReqExec.getDestFile());
FlatFileItemWriter<ItemProcessing> writer = write(localOutputDir,localOutputFile);
StepExecutionListener stepExecListener = new StepExecutionListener() {
#Override
public void beforeStep(StepExecution stepExecution) {
logger.info("Job {} step start {}",stepExecution.getJobExecutionId(), stepExecution.getStepName());
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
logger.info("Job {} step end {}",stepExecution.getJobExecutionId(), stepExecution.getStepName());
//.......ingore some code
return finalStatus;
}
};
Tasklet resultFileTasklet = new BatchFileResultTasklet(localOutputFile, httpClientService);
TaskletStep resultFileStep = stepBuilder.get("result")
.tasklet(resultFileTasklet)
.listener(stepExecListener)
.build();
//create step
Step mainStep = stepBuilder.get("run")
.<ItemProcessing, ItemProcessing>chunk(5)
.faultTolerant()
.skip(IOException.class).skip(SocketTimeoutException.class)//skip IOException here
.skipLimit(2000)
.reader(reader)
.processor(processor)
.writer(writer)
.listener(stepExecListener)
.listener(new ItemProcessorListener()) //add process listener
.listener(skipExceptionListener) //add skip exception listner
.build();
//create job
Job job = jobBuilder.get(batchReqExec.getId())
.start(mainStep)
.next(resultFileStep)
.build();
JobParametersBuilder jobParamBuilder = new JobParametersBuilder();
//run job
JobExecution execution = jobLauncher.run(job, jobParamBuilder.toJobParameters());
read data like below:
private FlatFileItemReader<ItemProcessing> read(BatchRequestsRun batchReqRun) throws Exception {
//prepare input file
File localInputDir = new File(localStoragePath+"/batch-requests");
if(!localInputDir.exists() || localInputDir.isFile()) {
localInputDir.mkdir();
}
File localFile = new File(localInputDir, batchReqRun.getFileRef()+"-"+batchReqRun.getFile());
if(!localFile.exists()) {
httpClientService.getFileFromStorage(batchReqRun.getFileRef(), localFile);
}
FlatFileItemReader<ItemProcessing> reader = new FlatFileItemReader<ItemProcessing>();
reader.setResource(new FileSystemResource(localFile));
reader.setLineMapper(new DefaultLineMapper<ItemProcessing>() {
{
setLineTokenizer(new DelimitedLineTokenizer());
setFieldSetMapper(new FieldSetMapper<ItemProcessing>() {
#Override
public ItemProcessing mapFieldSet(FieldSet fieldSet) throws BindException {
ItemProcessing item = new ItemProcessing();
item.setFieldSet(fieldSet);
return item;
}
});
}
});
return reader;
}
process data like below:
private ItemProcessor<ItemProcessing, ItemProcessing> process(BatchRequestsDef batchReqDef) {
ItemProcessor<ItemProcessing, ItemProcessing> processor = (input) -> {
VelocityContext context = new VelocityContext();
//.....ingore velocity code
String responseBody = null;
//send http invoking
input.setResponseBody(httpClientService.process(batchReqDef, input));
responseBody = input.getResponseBody();
logger.info(responseBody);
// using Groovy to parse response
Binding binding = new Binding();
try {
binding.setVariable("response", responseBody);
GroovyShell shell = new GroovyShell(binding);
Object result = shell.evaluate(batchReqDef.getConfig().getResponseHandler());
input.setResult(result.toString());
} catch(Exception e) {
logger.error("parse groovy script found exception:{},{}",e.getMessage(),e);
}
return input;
};
return processor;
}
Ignore writing file method here.
Who can help me to implement process method with multi thread ?
I guess spring batch read one line data and then process one line (execute ItemProcessor to invoke remote service directly)
As We known, the speed of read one line data much more than invoking http service one time.
So I want to read all data(or some part data) into memory (List) with single thread,and then invoke remote call with multi thread in step 2.
(It's very easy with using java thread pool ,but i don't known implementation with spring batch)
Please show me some code, thanks a lot!
Hello Spring Batch community! I have an input flat file with a header and a body. header is 1 line (naturally..) and 5 parameters. Body can reach up to 1 million records with 12 parameters each.
Input File:
01.01.2017|SUBDCOBR|12:21:23|01/12/2016|31/12/2016
01.01.2017|12345678231234|0002342434|BORGIA RUBEN|27-32548987-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,01
01.01.2017|12345673201234|2342434|ALVAREZ ESTHER|27-32533987-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,02
01.01.2017|12345673201234|0002342434|LOPEZ LUCRECIA|27-32553387-9|FA|A|2062-
00010443/444/445|142,12|30/08/2017|142,12
01.01.2017|12345672301234|0002342434|SILVA JESUS|27-32558657-9|NC|A|2062-
00010443|142,12|30/08/2017|142,12
.
.
.
I need to write this into a .txt file with certain format, and in this specific structure:
HEADER (8 customed lines, using data from HEADER input)
TITLE OF COLUMNS (1 line)
DETAILS (17 records from the body)
line break
SAME HEADER
SAME TITLE OF COLUMNS
DETAILS (next 17 records from the body)
line break
...
...
...
REPEAT until end of file
What I did was... create a stepHeader and a stepBody . Each of them with their own reader, processor (business formatter) and writer.
The job will have only this 2 simple steps.
#Bean
public Job job() throws Exception {
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.listener(new JobListener())
.start(stepHeader())
.next(stepBody())
.on("BACK TO STEPHEADER").to(stepHeader())
.on("END").end().build()
.build();
}
The header i read is configured with MaxItemCount=1, and mapped it to CabeceraFacturacion:
#Bean
public FlatFileItemReader<CabeceraFacturacion> readerCabecera() throws Exception{
FlatFileItemReader<CabeceraFacturacion> reader = new FlatFileItemReader<>();
reader.setLinesToSkip(0);
reader.setMaxItemCount(1);
reader.setResource(new ClassPathResource("/inputFiles/input.txt"));
DefaultLineMapper<CabeceraFacturacion> cabeceraLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer tokenizer = new DelimitedLineTokenizer("|"); // en el default, la coma es el separador
tokenizer.setNames(new String[] {"printDate", "reportIdentifier", "tituloReporte", "fechaDesde", "fechaHasta"});
cabeceraLineMapper.setLineTokenizer(tokenizer);
cabeceraLineMapper.setFieldSetMapper(new CabeceraFieldSetMapper());
cabeceraLineMapper.afterPropertiesSet();
reader.setLineMapper(cabeceraLineMapper);
return reader;
}
The body i read it this way, skipping first line, and mapped it to DetalleFacturacion:
#Bean
public FlatFileItemReader<DetalleFacturacion> readerDetalleFacturacion(){
FlatFileItemReader<DetalleFacturacion> reader = new FlatFileItemReader<>();
reader.setLinesToSkip(1);
//reader.setMaxItemCount(17);
reader.setResource(new ClassPathResource("/inputFiles/input.txt"));
DefaultLineMapper<DetalleFacturacion> detalleLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer tokenizerDet = new DelimitedLineTokenizer("|"); // en el default, la coma es el separador
tokenizerDet.setNames(new String[] {"fechaEmision", "tipoDocumento", "letra", "nroComprobante",
"nroCliente", "razonSocial", "cuit", "montoNetoGP", "montoNetoG3",
"montoExento", "impuestos", "montoTotal"});
detalleLineMapper.setLineTokenizer(tokenizerDet);
detalleLineMapper.setFieldSetMapper(new DetalleFieldSetMapper());
detalleLineMapper.afterPropertiesSet();
reader.setLineMapper(detalleLineMapper);
return reader;
}
My Steps:
#Bean
public Step stepHeader() throws Exception {
return stepBuilderFactory.get("stepHeader")
.<CabeceraFacturacion, CabeceraFacturacion> chunk(17)
.faultTolerant()
.listener(new ChunkListener())
.reader(readerCabecera())
.writer(writerCabeceraFact())
.allowStartIfComplete(true)
.build();
}
#Bean
public Step stepBody() {
return stepBuilderFactory.get("stepBody")
.<DetalleFacturacion, DetalleFacturacion> chunk(17)
.chunk(17)
.faultTolerant()
.listener(new ChunkListener())
.reader(readerDetalleFacturacion())
.writer(writerDetalleFact())
.listener(new StepExecutionListener() {
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
if(stepExecution.getWriteCount()==17) {
return new ExitStatus("BACK TO STEPHEADER");
};
// if(stepExecution.getReadCount()<17) {
// return new ExitStatus("END");
// }
return null;
}
#Override
public void beforeStep(StepExecution stepExecution) {
}
})
.allowStartIfComplete(true)
.build();
}
1) I don't know how to achieve going back to the StepHeader indefinetely until the file ends. There i tried usind the stepExecution.getWriteCount(17).. but i'm not sure this is the way.
2) i donĀ“t know how to read 17 different records every time it loops ( i managed to make it loop but it would write the same first 17 records over and over again until i manually stopped the job. I now know that loops are not recommended in Spring Batch processes.
3) if anyone has any idea on another way to achieve my goal, it will be most welcome.
4) Is there a way to make a decider that's "hearing" all the time, and sends the order to print header or body if certain condition is satisfied?
Up until now, the max i achieved is to read & write only one time the header... and in the next step read & write 17 lines of the body.
Thank you everyone!
Cheers!!
Not Sure if i understood your question correctly, But this what you want to achive
Step 1 : Read header from file
Step 2 : Read file ,process data and write to some file Until some condition A
Step 3 : On Condition A Go to Step 1
There can be multiple options to configure this. the one i can think is by adding additional step for flow decision .. below is sample configuration.
Note I have not tested this, you might have to do some modifications
#Bean
public Job conditionalJob(JobBuilderFactory jobs, Step conditionalStep1, Step conditionalStep2, Step conditionalStep3, Step conditionalStep4, Step conditionalStep5) throws Exception {
return jobs.get("conditionalJob")
.incrementer(new RunIdIncrementer())
.flow(flowDesider).on("HEADER").to(step1).next("flowDesider")
.from(flowDesider).on("BODAY").to(step2).next("flowDesider")
.from(flowDesider).on("*").stop()
.end()
.build();
}
public class flowDesider implements Tasklet{
private Logger logger = LoggerFactory.getLogger(this.getClass());
#Override
public RepeatStatus execute(StepContribution contribution,
ChunkContext chunkContext) throws Exception {
logger.info("flowDesider");
//put your flow logic header
//you can use step excequation to pass infomrtion from one step to onother
if(codition1)
return status as HEADER
if (condition2)
return status as Body
if condition3
return status as complited
}
Using Spring Integration Java DSL, I have constructed a flow where I'm processing files synchronously with a FileSplitter. I've been able to use the setDeleteFiles flag on a AbstractFilePayloadTransformer to delete the file after converting each line in File to a Message for subsequent processing, like so:
#Bean
protected IntegrationFlow s3ChannelFlow() {
// do not exhaust filesystem w/ files downloaded from S3
FileToInputStreamTransformer transformer = new FileToInputStreamTransformer();
transformer.setDeleteFiles(true);
// #see http://docs.spring.io/spring-integration/reference/html/files.html#file-reading
// #formatter:off
return IntegrationFlows
.from(s3Channel())
.channel(StatsUtil.createRunStatsChannel(runStatsRepository))
.transform(transformer)
.split(new FileSplitter())
.transform(new JsonToObjectViaTypeHeaderTransformer(new Jackson2JsonObjectMapper(objectMapper), typeSupport))
.publishSubscribeChannel(p -> p.subscribe(persistenceSubFlow()))
.get();
// #formatter:on
}
This works fine, but is slow. So I attempt to add an ExecutorChannel after the .split above, like so:
.channel(c -> c.executor(Executors.newFixedThreadPool(10)))
But then the aforementioned delete flag does not allow the flow to complete successfully deleting file(s) before they are completely read.
If I remove the flag I have the potential to exhaust the local file system where files were synchronized from S3.
What could I introduce above to a) process each file completely and b) delete file from local filesystem once done? In other words, is there a way to get to know exactly when a file is completely processed (when it's lines have been processed asynchronously via threads in a pool)?
If you're curious here's my impl of FileToInputStreamTransformer:
public class FileToInputStreamTransformer extends AbstractFilePayloadTransformer<InputStream> {
private static final int BUFFER_SIZE = 64 * 1024; // 64 kB
#Override
// #see http://java-performance.info/java-io-bufferedinputstream-and-java-util-zip-gzipinputstream/
protected InputStream transformFile(File payload) throws Exception {
return new GZIPInputStream(new FileInputStream(payload), BUFFER_SIZE);
}
}
UPDATE
So how does something in downstream flow know what to ask for?
Incidentally, if I'm following your advice correctly, when I update the .split with new FileSplitter(true, true) above, I get
2015-10-20 14:26:45,288 [pool-6-thread-1] org.springframework.integration.handler.LoggingHandler ERROR org.springframework.integration.transformer.MessageTransformationException: failed to transform message; nested exception is java.lang.IllegalArgumentException: 'json' argument must be an instance of: [class java.lang.String, class [B, class java.io.File, class java.net.URL, class java.io.InputStream, class java.io.Reader] , but gotten: class org.springframework.integration.file.splitter.FileSplitter$FileMarker
at org.springframework.integration.transformer.AbstractTransformer.transform(AbstractTransformer.java:44)
The FileSplitter has markers option exactly for this purpose:
Set to true to emit start/end of file marker messages before and after the file data. Markers are messages with FileSplitter.FileMarker payloads (with START and END values in the mark property). Markers might be used when sequentially processing files in a downstream flow where some lines are filtered. They enable the downstream processing to know when a file has been completely processed. The END marker includes a line count. Default: false. When true, apply-sequence is false by default.
You can use it in the downstream flow to determine if can remove file already or not yet.
Thanks Artem.
I did manage to address the issue, but perhaps in a more heavy-weight manner.
Introducing an ExecutorChannel caused quite the ripple of implementation adjustments, including: turning off the setDeleteFiles flag on the AbtractFilePayloadTransformer, updating a JPA #Entity, RunStats and repository for such, to capture file processing status as well as processing status for an entire run. Taken together the processing status updates lets the flow know when to delete files from local filesystem (i.e., when they're fully processed) and to return a status in a /stats/{run} endpoint so a user can know when a run is completed.
Here are snippets from my implementation (if anyone's curious)...
class FileToInputStreamTransformer extends AbstractFilePayloadTransformer<InputStream> {
private static final int BUFFER_SIZE = 64 * 1024; // 64 kB
#Override
// #see http://java-performance.info/java-io-bufferedinputstream-and-java-util-zip-gzipinputstream/
protected InputStream transformFile(File payload) throws Exception {
return new GZIPInputStream(new FileInputStream(payload), BUFFER_SIZE);
}
}
public class RunStatsHandler extends AbstractMessageHandler {
private final SplunkSlf4jLogger log = new SplunkSlf4jLogger(LoggerFactory.getLogger(getClass()));
private static final int BUFFER_SIZE = 64 * 1024; // 64 kB
private final RunStatsRepository runStatsRepository;
public RunStatsHandler(RunStatsRepository runStatsRepository) {
this.runStatsRepository = runStatsRepository;
}
// Memory efficient routine, #see http://www.baeldung.com/java-read-lines-large-file
#Override
protected void handleMessageInternal(Message<?> message) throws Exception {
RunStats runStats = message.getHeaders().get(RunStats.RUN, RunStats.class);
String token = message.getHeaders().get(RunStats.FILE_TOKEN, String.class);
if (runStats != null) {
File compressedFile = (File) message.getPayload();
String compressedFileName = compressedFile.getCanonicalPath();
LongAdder lineCount = new LongAdder();
// Streams and Scanner implement java.lang.AutoCloseable
InputStream fs = new FileInputStream(compressedFile);
InputStream gzfs = new GZIPInputStream(fs, BUFFER_SIZE);
try (Scanner sc = new Scanner(gzfs, "UTF-8")) {
while (sc.hasNextLine()) {
sc.nextLine();
lineCount.increment();
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
log.warn("file.lineCount", ImmutableMap.of("run", runStats.getRun(), "file", compressedFileName,
"exception", sc.ioException().getMessage()));
throw sc.ioException();
}
runStats.addFile(compressedFileName, token, lineCount.longValue());
runStatsRepository.updateRunStats(runStats);
log.info("file.lineCount",
ImmutableMap.of("run", runStats.getRun(), "file", compressedFileName, "lineCount", lineCount.intValue()));
}
}
}
}
Updated flow
#Bean
protected IntegrationFlow s3ChannelFlow() {
// #see http://docs.spring.io/spring-integration/reference/html/files.html#file-reading
// #formatter:off
return IntegrationFlows
.from(s3Channel())
.enrichHeaders(h -> h.headerFunction(RunStats.FILE_TOKEN, f -> UUID.randomUUID().toString()))
.channel(runStatsChannel())
.channel(c -> c.executor(Executors.newFixedThreadPool(persistencePoolSize)))
.transform(new FileToInputStreamTransformer())
.split(new FileSplitter())
.transform(new JsonToObjectViaTypeHeaderTransformer(new Jackson2JsonObjectMapper(objectMapper), typeSupport))
.publishSubscribeChannel(p -> p.subscribe(persistenceSubFlow()))
.get();
// #formatter:on
}
#Bean
public MessageChannel runStatsChannel() {
DirectChannel wiretapChannel = new DirectChannel();
wiretapChannel.subscribe(new RunStatsHandler(runStatsRepository));
DirectChannel loggingChannel = new DirectChannel();
loggingChannel.addInterceptor(new WireTap(wiretapChannel));
return loggingChannel;
}
Unfortunately, I can't share the RunStats and repo implementations.