How to read multiple files in chunks using Spring-Batch?

How to read multiple files in chunks using Spring-Batch? - java

I'm using Spring-Batch to read csv files sequentially with MultiResourceItemReader.
I want to create a reader that
reads the chunksize from file 1
reads the chunksize from file 2
compare both what has been read and create some kind of "patch" object
write the patch object to database
Now the problem with MultiResourceItemReader is that it will first read the full file1 in chunks, and when the file is finished, it will continue with file2.
How can I create batch steps that will switch between the files based on the chunksize?

You're going to need to create a custom reader to address what you're attempting. You can use the FlatFileItemReader under the hood for the actual file reading, but the logic of reading from two files at once you'll have to orchestrate yourself. Just coding off the top of my head, I'd expect something like this:
public class MultiFileReader implements ItemReader<SomeObject> {
private List<ItemStreamReader> readers;
public SomeObject read() {
SomeObject domainObject = new SomeObject();
for(ItemStreamReader curReader : readers) {
domainObject.add(curReader.read());
}
return domainObject;
}
}

you could use something like
#Bean
public MultiResourceItemReader<Company> readerCompany() throws IOException {
DelimitedLineTokenizer dlt = new DelimitedLineTokenizer();
dlt.setDelimiter("^");
dlt.setNames("name", "cui", "code", "euid", "companyState", "address");
dlt.setStrict(false);
return new MultiResourceItemReaderBuilder<Company>()
.name("readerCompany")
.resources(inputCompanyResources)
.delegate(new FlatFileItemReaderBuilder<Company>()
.name("getCompanyStatusReader")
.fieldSetMapper(new FieldSetMapper<Company>() {
#Override
public Company mapFieldSet(FieldSet fieldSet) throws BindException {
return Company.builder()
.name(fieldSet.readString("name"))
.localId(fieldSet.readString("cui"))
.code(fieldSet.readString("code"))
.companyStatus(readCompanyStatuses(fieldSet.readString("companyState")))
.address(fieldSet.readString("address"))
.internationalId(fieldSet.readString("euid"))
.build();
}
})
.linesToSkip(1)
.lineTokenizer(dlt)
.build())
.build();
}

Related

Spring Batch Reader Writer transfer of data not working as expected

I have created a Generic Spring Batch job for Processing of Data and storing into a CSV. I need some data from the Reader passed into the writer which I am trying to do using JobExecution. However suprisingly, the code seems to call the getWriter() first than the getReader() function.
My config is given below. Could someone explain why it is happening and if there is any alternative way to pass data from reader to writer.
#Bean
#StepScope
public ItemReader<Map<String, Object>> getDataReader() throws Exception {
return springBatchReader.getReader();
}
#Bean
#StepScope
public FlatFileItemWriter<Map<String, Object>> getDataWriter() throws Exception {
return (FlatFileItemWriter<Map<String, Object>>) springBatchWriter.getWriter();
}
#Bean
public Job SpringBatchJob(Step generateReport) throws Exception {
return jobBuilderFactory.get("SpringBatchJob" + System.currentTimeMillis())
.preventRestart()
.incrementer(new RunIdIncrementer())
.flow(generateReport)
.end()
.build();
}
#Bean
public Step generateReport() throws Exception {
return stepBuilderFactory.get("generateReport").<Map<String, Object>, Map<String, Object>>chunk(batchSize)
.reader(getDataReader()).writer(getDataWriter()).build();
}
The Data I want to pass from Reader to Writer is the column names for the CSV. Since my Reader runs variable SQL queries(passing the SQL query to be run as a command line argument) and hence the result-set/columns are not static and vary based on the given query. To provide the writer with the column names to be written for that particular execution in the setHeaderCallback was the rationale behind sending data from Reader to Writer.
The Reader simple runs the given query and puts the data into Map<String, Object> rather than any POJO due to the variable nature of the data. Here the key of the Map represent the column name while the corresponding object holds the values for that column. So essentially I want the writer setHeaderCallback to be able to access Keys of the passed Map or pass the keys from the Reader to the Writer somehow.
The Writer Code is as follows:
public FlatFileItemWriter<Map<String, Object>> getWriter() throws Exception {
String reportName = getReportName();
saveToContext(reportName, reportPath);
FileSystemResource resource = new FileSystemResource(String.join(File.separator, reportPath, getReportName()));
FlatFileItemWriter<Map<String, Object>> flatFileItemWriter = new FlatFileItemWriter<>();
flatFileItemWriter.setResource(resource);
//NEED HELP HERE..HOW TO SET THE HEADER TO BE THE KEYS OF THE MAP
//flatFileItemWriter.setHeaderCallback();
flatFileItemWriter.setLineAggregator(new DelimitedLineAggregator<Map<String, Object>>() {
{
setDelimiter(delimiter);
setFieldExtractor(
new PassThroughFieldExtractor<>()
);
}
});
flatFileItemWriter.afterPropertiesSet();
return flatFileItemWriter;
}

The execution order of those methods does not matter. You should not be looking for a way to pass data from the reader to the writer using the execution context, the Chunk-oriented Tasklet implementation provided by Spring Batch will do that for you.
The execution context could be used to pass data from one step to another, but not from the reader to the writer within the same step.
EDIT: update answer based on comments:
Your issue is that you are calling saveToContext(reportName, reportPath); in the getWriter method. This method is called at configuration time and not at runtime.
What you really need is provide the column names either via job parameters or put them in the execution context with a step, then use a step-scoped Header callback that is configured with those headers.
You can find an example here: https://stackoverflow.com/a/56719077/5019386. This example is for the lineMapper but you can do the same for the headerCallback. If you don't want to use the job parameters approach, you can create a tasklet step that determines column names and puts them in the execution context, then configure the step-scoped header callback with those names from the execution context, something like:
#Bean
#StepScope
public FlatFileHeaderCallback headerCallback(#Value("#{jobExecutionContext['columnNames']}") String columnNames) {
return new FlatFileHeaderCallback() {
#Override
public void writeHeader(Writer writer) throws IOException {
// use columnNames here
}
};
}

Master/slave Job architectural design in spring batch using modular job approach

I hope you're doing great.
I'm facing design problem in spring batch.
Let me explain:
I have a modular spring batch job architecture,
each job has its own config file et context.
I am designing a master Job to launch the subjobs (50+ types of subjobs).
X obj has among other name, state and blob which contains the csv file attached to it.
X obj will be updated after being processed.
I follow the first approach of fetching all X obj and then looping (in java stream) to call the appropriate job.
But this approach has a lot of limitations.
So I design a masterJob with reader processor and writer.
MasterJob should read X obj and call the appropriate subJob and the update the state of X obj.
masterJobReader which call a custom service to get a list of let's say X obj.
I started by trying to launch subjob from within the masterJob processor but It did not work.
I did some research and I find that JobStep could be more adequate for this scenario.
But I'm stuck with how to pass the item read by masterJobReader to JobStep has parameter.
I did saw DefaultJobParameterExtractor and I try to set the Item read into the stepExecutionContext but It's not working.
My question how to pass parameter from MasterJob to SubJob using
JobStep approach?
If there is better way to deal with this then I'm all yours!
I'm using Java Config and spring batch 4.3.
Edit to provide sample code:
#Configuration
public class MasterJob {
#Value("${defaultCompletionPolicy}")
private Integer defaultCompletionPolicy;
#Autowired
protected StepBuilderFactory masterStepBuilderFactory;
private Logger logger = LoggerFactory.getLogger(MasterJob.class);
#Autowired
protected JobRepository jobRepo;
#Autowired
protected PlatformTransactionManager transactionManager;
#Autowired
#Qualifier("JOB_NAME1")
private Job JOB_NAME1; // this should change to be dynamic as there are around 50 types of job
#Bean(name = "masterJob")
protected Job masterBatchJob() throws ApiException {
return new JobBuilderFactory(jobRepo).get("masterJob")
.incrementer(new RunIdIncrementer())
.start(masterJobStep(masterJobReader(), masterJobWriter()))
.next(jobStepJobStep1(null))
.next(masterUpdateStep()) // update the state of objX
.build();
}
#Bean(name = "masterJobStep")
protected Step masterJobStep(#Qualifier("masterJobReader") MasterJobReader masterReader,
#Qualifier("masterJobWriter") MasterJobWriter masterWriter) throws ApiException {
logger.debug("inside masterJobStep");
return this.masterStepBuilderFactory.get("masterJobStep")
.<Customer, Customer>chunk(defaultCompletionPolicy)
.reader(masterJobReader())
.processor(masterJobProcessor())
.writer(masterJobWriter())
.transactionManager(transactionManager)
.listener(new MasterJobWriter()) // I set the parameter inside this.
.listener(masterPromotionListener())
.build();
}
#Bean(name = "masterJobWriter", destroyMethod = "")
#StepScope
protected MasterJobWriter masterJobWriter() {
return new MasterJobWriter();
}
#Bean(name = "masterJobReader", destroyMethod = "")
#StepScope
protected MasterJobReader masterJobReader() throws ApiException {
return new MasterJobReader();
}
protected FieldSetMapper<Customer> mapper() {
return new CustomerMapper();
}
#Bean(name="masterPromotionListener")
public ExecutionContextPromotionListener masterPromotionListener() {
ExecutionContextPromotionListener listener = new ExecutionContextPromotionListener();
listener.setKeys(
new String[]
{
"inputFile",
"outputFile",
"customerId",
"comments",
"customer"
});
//listener.setStrict(true);
return listener;
}
#Bean(name = "masterUpdateStep")
public Step masterUpdateStep() {
return this.masterStepBuilderFactory.get("masterCleanStep").tasklet(new MasterUpdateTasklet()).build();
}
#Bean(name = "masterJobProcessor", destroyMethod = "")
#StepScope
protected MasterJobProcessor masterJobProcessor() {
return new MasterJobProcessor();
}
#Bean
public Step jobStepJobStep1(JobLauncher jobLauncher) {
return this.masterStepBuilderFactory.get("jobStepJobStep1")
.job(JOB_NAME1)
.launcher(jobLauncher)
.parametersExtractor(jobParametersExtractor())
.build();
}
#Bean
public DefaultJobParametersExtractor jobParametersExtractor() {
DefaultJobParametersExtractor extractor = new DefaultJobParametersExtractor();
extractor.setKeys(
new String[] { "inputFile", "outputFile", , "customerId", "comments", "customer" });
return extractor;
}
}
This is how I set parameter from within the MasterJobWriter:
String inputFile = fetchInputFile(customer);
String outputFile = buildOutputFileName(customer);
Comments comments = "comments"; // from business logic
ExecutionContext stepContext = this.stepExecution.getExecutionContext();
stepContext.put("inputFile", inputFile);
stepContext.put("outputFile", outputFile);
stepContext.put("customerId", customer.getCustomerId());
stepContext.put("comments", new CustomJobParameter<Comments>(comments));
stepContext.put("customer", new CustomJobParameter<Customer>(customer));
I follow this section of the documentation of spring batch

My question how to pass parameter from MasterJob to SubJob using JobStep approach?
The JobParametersExtractor is what you are looking for. It allows you to extract parameters from the main job and pass them to the subjob. You can find an example here.
EDIT: Adding suggestions based on comments
I have a list of X obj in the DB. X obj has among other fields, id, type(of work), name, state and blob which contains the csv file attached to it. The blob field containing the csv file depends on the type field so it's not one pattern csv file. I need to process each X obj and save the content of the csv file in the DB and generate a csv result file containing the original data plus a comment field in the result csv file and update X obj state with the result csv field attached to X obj and other fields.
As you can see, the process is already complex for a single X object. So trying to process all X objects in the same job of jobs is too complex IMHO. So much complexity in software comes from trying to make one thing do two things..
If there is better way to deal with this then I'm all yours!
Since you are open for suggestions, I will recommend two options:
Option 1:
If it were up to me, I would create a job instance per X obj. This way, I can 1) parallelize things and 2) in case of failure, restart only the failed job. These two characteristics (Scalability and Restartability) are almost impossible with the job of jobs approach. Even if you have a lot of X objects, this is not a problem. You can use one of the scaling techniques provided by Spring Batch to process things in parallel.
Option 2:
If you really can't or don't want to use different job instances, you can use a single job with a chunk-oriented step that iterates over X objects list. The processing logic seems independent from one record to another, so this step should be easily scalable with multiple threads.

Spring Batch - FlatFileItemWriter write to a file, close it, write to new file

So I have a custom class that extends FlatFileItemWriter that I am using to write a CSV file. Based on some condition I want to finish with my first file and then write to a new file. Is there a way to do this with FlatFileItemWriter?
Here is an example of my code:
#Override
public void write(List<? extends MyObject> items) throws Exception {
ExecutionContext stepContext = this.stepExecution.getExecutionContext();
int currentFileIncrememnt = (int) stepContext.get("currentFileIncrement");
if(currentFileIncrement== fileIncrement) {
super.write(items);
}
else {
super.close();
fileIncrement = currentFileIncrement;
super.setHeaderCallback(determineHeaderCallback());
super.setResource(new FileSystemResource("src/main/resources/" + fileIncrement+ ".csv"));
super.setShouldDeleteIfExists(true);
DelimitedLineAggregator<MyObject> delLineAgg = new DelimitedLineAggregator<>();
delLineAgg.setDelimiter(",");
BeanWrapperFieldExtractor<MyObject> fieldExtractor = new BeanWrapperFieldExtractor<>();
fieldExtractor.setNames(new String[] {"id", "amount"});
delLineAgg.setFieldExtractor(fieldExtractor);
super.setLineAggregator(delLineAgg);
super.write(items);
}
}

I don t understand if you are using a custom writer or the spring one. If your using the custom (maybe extendig from spring one) you could use whatever you want by passing parameters through the processor, reader o mapper. If you want to use the Spring writer you should make isolated steps.
Give more details.

How to read complete line without any mapper with spring batch

I am learning spring batch, i have basic idea of reader, mapper and writer. Now my current requirement is to read each line from file, compress the data and write in a gigaspace based in-memory grid.
I understand that a line mapper is compulsory attribute for iteam reader. I have no use of created a mapper, all i need to do is read the line and send it to writer to write in the grid. So how can i skip the line mapper or how can i read the plain line. Currently i have done something like this, which doesnot seems to be idead solution.
public class ShrFileMapper implements FieldSetMapper<SpaceDocument> {
#Override
public SpaceDocument mapFieldSet(FieldSet fieldSet) throws BindException {
String positionId = fieldSet.readString(0);
StringBuffer line = new StringBuffer();
for (String fieldValue : fieldSet.getValues()) {
line.append("\t").append(fieldSet);
}
// logic for compression
SpaceDocument spaceDocument = new SpaceDocument("shr.doc");
spaceDocument.setProperty("id", positionId);
spaceDocument.setProperty("payload", compressedString);
return spaceDocument;
}
}

Assuming you are using a FlatFileItemReader, you need to provide a resource and a LineMapper. As you do not want to turn the line of input into anything else you do not need a LineTokenizer, you just want to passthrough the raw input. For more information you can checkout the official documentation:
http://docs.spring.io/spring-batch/reference/html/readersAndWriters.html#flatFileItemReader
Spring has provided already such functionality.
Please checkout the PassThroughLineMapper https://github.com/spring-projects/spring-batch/blob/master/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/file/mapping/PassThroughLineMapper.java
public class PassThroughLineMapper implements LineMapper<String>{
#Override
public String mapLine(String line, int lineNumber) throws Exception {
return line;
}
}
This class does exactly what you need!

How to read all lines in ItemReader and return lines and file address to ItemProcessor?

I defined a job flow in my batch Spring project and defined ItemReader, ItemProcessor, ItemWriter, etc.
My ItemReader as below code :
#Component
#StepScope
public class MyFileReader extends FlatFileItemReader<FileInfo> {
private String fileName;
public MyFileReader () {
}
#Value("#{jobParameters[fileName]}")
public void setFileName(final String fileName) {
this.fileName = fileName;
}
#Override
public void afterPropertiesSet() throws Exception {
Resource resource = new FileSystemResource(fileName);
setResource(resource);
setEncoding("UTF-8");
super.afterPropertiesSet();
}
}
and my file input format is:
111111,11111,111,111
222222,22222,222,222
I want to read all lines of file and return lines and file address to ItemProcessor, but FlatFileItemReader read line by line. How do I do it correctly? Is overriding doRead method and handle problem manually correct?

If I'm understanding the question, you want to read in all lines from a file, store that data in an object and then pass said object to the processor. One approach would be to read all lines from the file before the job starts using a Job Listener. As illustrated below, you could read all lines in, populate a Java object that represents the content of a single row, collect all of those objects (so if there were two rows you'd populate 2 beans), and then pass them to the processor one at a time (or potentially at the same time, if you wish). It would look something like this:
First you would create a listener.
public class MyJobListenerImpl implements JobExecutionListener {
private MyFileReader reader;
#Override
public void beforeJob(JobExecution jobExecution) {
reader.init();
}
#Override
public void afterJob(JobExecution jobExecution) {
// noop
}
// Injected
public void setReader(MyFileReader reader) {
this.reader = reader;
}
Next add an init method to your custom reader.
public void init() {
if(Files.exists(inputFileLocation)) {
List<String> inputData = null;
try {
inputData = Files.readAllLines(inputFileLocation, StandardCharsets.UTF_8);
} catch(IOException e) {
System.out.println("issue reading input file {}. Error message: {}", inputFileLocation, e);
throw new IllegalStateException("could not read the input file.");
}
try {
for(String fileItem : inputData) {
YourFileDataBean fileData = new YourFileDataBean();
yourFileDataBean.setField1(fileItem.split(",")[0].trim());
yourFileDataBean.setFiled2(fileItem.split(",")[1].trim());
yourFileDataBean.setField3(fileItem.split(",")[2].trim());
yourFileDataBean.setField4(fileItem.split(",")[3].trim());
myDeque.add(yourFileDataBean); // really up to you how you want to store your bean but you could add a Deque instance variable and store it there.
}
} catch(ArrayIndexOutOfBoundsException e) {
LOGGER.warn("ArrayIndexOutOfBoundsException due to data in input file.");
throw new IllegalStateException("Failure caused by init() method. Error reading in input file.");
}
} else {
LOGGER.warn("Input file {} does not exist.", inputFileLocation);
throw new IllegalStateException("Input file does not exist at file location " + inputFileLocation);
}
}
Make your read() (or MyFileReader()) method in your custom reader return the object populated by all the file lines read in. In this example I am implementing ItemReader rather than extending it as you have done, but you get the idea. And if you intend to return a single Java object that represents the entire file then there would be no need to store the object in a Deque or List.
#Override
public MyFileReader read() throws NonTransientResourceException {
return myDeque.poll();
}
Hope this helps.
As for returning the file address to the ItemProcessor. You could make this a field in YourFileDataBean and store inputFileLocation there, or save it to the execution context and access it that way. If you inject this file path into your reader, you could do the same in your processor assuming your reader plays no role in determining the file path (aka, it's predetermined).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.