Can I use FlatfileItemReader with Taskexecutor? - java

Can I use FlatfileItemReader with Taskexecutor in spring batch??
I have implemented FlatFileItemReader with ThreadPoolTaskExecutor. When I print the records in ItemProcessor, I do not get consistent results, i.e. not all the records are printed and sometimes one of the record is printed more than once. It leads me to the fact that FlatFileItemReader is not thread safe and also it says the same in spring docs but I see some blogs where it says it is possible to use FlatFileItemReader with Task Executor.
So my question is: Is it possible to use FlatfileItemReader with Task Executor in anyway ?
#Bean
#StepScope
public FlatFileItemReader<DataLifeCycleEvent> csvFileReader(
#Value("#{stepExecution}") StepExecution stepExecution) {
Resource inputResource;
FlatFileItemReader<DataLifeCycleEvent> itemReader = new FlatFileItemReader<>();
itemReader.setLineMapper(new OnboardingLineMapper(stepExecution));
itemReader.setLinesToSkip(1);
itemReader.setSaveState(false);
itemReader.setSkippedLinesCallback(new OnboardingHeaderMapper(stepExecution));
String inputResourceString = stepExecution.getJobParameters().getString("inputResource");
inputResource = new FileSystemResource(inputFileLocation + ApplicationConstant.SLASH + inputResourceString);
itemReader.setResource(inputResource);
stepExecution.getJobExecution().getExecutionContext().putInt(ApplicationConstant.ERROR_COUNT, 0);
return itemReader;
}

FlatFileItemReader extends AbstractItemCountingItemStreamItemReader which is NOT thread-safe. So if you use it in a multi-threaded step, you need to synchronize it.
You can wrap it in a SynchronizedItemStreamReader. Here is a quick example:
#Bean
public SynchronizedItemStreamReader<DataLifeCycleEvent> itemReader() {
FlatFileItemReader<DataLifeCycleEvent> itemReader = ... // your item reader
SynchronizedItemStreamReader<DataLifeCycleEvent> synchronizedItemStreamReader = new SynchronizedItemStreamReader<>();
synchronizedItemStreamReader.setDelegate(itemReader);
return synchronizedItemStreamReader;
}

This method is giving this exception :-java.lang.ClassCastException: com.sun.proxy.$Proxy344 cannot be cast to org.springframework.batch.item.support.SynchronizedItemStreamReader
#Bean
public SynchronizedItemStreamReader<DataLifeCycleEvent> itemReader() {
FlatFileItemReader<DataLifeCycleEvent> itemReader = ... // your item reader
SynchronizedItemStreamReader<DataLifeCycleEvent> synchronizedItemStreamReader = new SynchronizedItemStreamReader<>();
synchronizedItemStreamReader.setDelegate(itemReader);
return synchronizedItemStreamReader;
}

Related

Threading a variable number of heterogeneous input files, process the input and output to a single file

I am trying improve the performance of the job listed below. As is, without threading, it runs successfully. But is runs very slow. I would like to thread step 2 where 95% of the work happens in the reading, filtering and transforming the input data read from very large heterogeneous files. The job:
• Step1 gets some job parameters that are passed into Step2.
• Step2 will read in X number of files. Each file is heterogenous, i.e., contains several different record formats. The records are filtered, transformed and sent to a single output file.
Does Spring Batch have a built-in way to thread Step2 in this scenario? For example, can I add some type of executor to step2? I’ve tried SimpleAsyncTaskExecutor and ThreadPoolTaskExecutor. Neither work. Adding SimpleAsyncTaskExecutor throws an exception. (See can we process the multiple files sequentially using spring Batch while multiple threads used to process individual files data..?)
Here is the batch configuration:
public Job job() {
return jobBuilderFactory.get("MyJob")
.start(step1())
.next(step2())
.build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("Step1GetJobParams")
.tasklet(MyParamsTasklet)
.build();
}
#Bean
public Step step2() {
return stepBuilderFactory.get("Step2")
.<InputDO, OutputDO>chunk(1000)
.reader(myMultiResourceReader())
.processor(myStep2ItemProcessor)
.writer(myStep2FileWriter())
.taskExecutor(???) line #23
.build();
}
#Bean
public MultiResourceItemReader<InputDO> myMultiResourceReader(){
MultiResourceItemReader<InputDO> multiResourceItemReader = new MultiResourceItemReader<InputDO>();
multiResourceItemReader.setResources(resourceManager.getResources());
multiResourceItemReader.setDelegate(myStep2FileReader());
multiResourceItemReader.setSaveState(false);
return multiResourceItemReader;
}
#Bean
public FlatFileItemReader<InputDO> myStep2FileReader() {
return new FlatFileItemReaderBuilder<InputDO>()
.name("MyStep2FileReader")
.lineMapper(myCompositeLineMapper())
.build();
}
#Bean
public PatternMatchingCompositeLineMapper<InputDO> myCompositeLineMapper() {
PatternMatchingCompositeLineMapper<InputDO> lineMapper = new PatternMatchingCompositeLineMapper<InputDO>();
Map<String, LineTokenizer> tokenizers = new HashMap<String, LineTokenizer>();
tokenizers.put("A", InputDOTokenizer.getInputDOTokenizer());
tokenizers.put("*", InputDOFillerTokenizer.getInputDOFillerTokenizer());
lineMapper.setTokenizers(tokenizers);
Map<String, FieldSetMapper<InputDO>> mappers = new HashMap<String, FieldSetMapper<InputDO>>();
mappers.put("A", new InputDOFieldSetMapper());
mappers.put("*", new InputDOFillerFieldSetMapper());
lineMapper.setFieldSetMappers(mappers);
return lineMapper;
}
#Bean
public FlatFileItemWriter<OutputDO> myOutputDOFileWriter() {
return new FlatFileItemWriterBuilder<OutputDO>()
.name("MyOutputDOFileWriter")
.resource(resourceManager.getFileSystemResource("myOutputDOFileName"))
.lineAggregator(new DelimitedLineAggregator<OutputDO>() {
{
setDelimiter("");
setFieldExtractor(outputDOFieldExtractor.getOutputDOFieldExtractor());
};
})
.lineSeparator("\r\n")
.build();
}
Any/all guidance is much appreciated!
I guess you want to use this mode of Multi-threaded Step to resolve read slowly problem. more details is available from spring batch office - Multi-threaded Step about it.
Hope to help you.

How to pass arguments from slave steps to reader in Spring Batch?

I have a spring batch process that is reading data from a database. Basically what happens is I have a SQL query that needs to get data by a column (type) value. That column has 50 different values. So there are 50 queries and each is executed on a separate slave step. But the query is building inside the Reader. So I need to pass each type to the Reader to build the query and read data. I am using Partitioner to separate the query with Offset and Limit.
Here is the code I have,
private Flow flow(List<Step> steps) {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(1);
return new FlowBuilder<SimpleFlow>("flow")
.split(taskExecutor).add(steps.stream().map(step -> new FlowBuilder<Flow>("flow_" + step.getName())
.start(step).build()).toArray(Flow[]::new)).build();
}
#Bean
public Job job() {
List<Step> masterSteps = TYPES.stream().map(this::masterStep).collect(Collectors.toList());
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.start(flow(masterSteps))
.end()
.build();
}
#Bean
#SneakyThrows
public Step slaveStep(String type) {
return stepBuilderFactory.get("slaveStep")
.<User, User>chunk(100)
.reader(reader(type, 0, 0))
.writer(writer())
.build();
}
#Bean
#SneakyThrows
public Step masterStep(String type) {
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep(type).getName(), partitioner(0))
.step(slaveStep(type))
.gridSize(5)
.taskExecutor(executor)
.build();
}
#Bean
#StepScope
#SneakyThrows
public JdbcCursorItemReader<User> reader(String type,
#Value("#{stepExecutionContext['offset']}") Integer offset,
#Value("#{stepExecutionContext['limit']}") Integer limit) {
String query = MessageFormat.format(SELECT_QUERY, type, offset, limit); // Ex: SELECT * FROM users WHERE type = 'type' OFFSET 500 LIMIT 1000;
JdbcCursorItemReader<User> itemReader = new JdbcCursorItemReader<>();
itemReader.setSql(query);
itemReader.setDataSource(dataSource);
itemReader.setRowMapper(new UserMapper());
itemReader.afterPropertiesSet();
return itemReader;
}
#Bean
#StepScope
public ItemWriter<User> writer() {
return new Writer();
}
#Bean
#StepScope
public Partitioner partitioner(#Value("#{jobParameters['limit']}") int limit) {
return new Partitioner(limit);
}
The issue I am using is to reader() method the type value is not passing. And even when I am adding #Bean annotation it is saying Could not autowire. No beans of 'String' type found.. If I didn't put #Bean offset and limit is always 0 because #Value is not populating. Right now when I am executing the batch nothing happens inside reader because type is null. When I am hardcoding the value it is working. So how can I fix this? Thanks in advance.
If you are iterating every TYPE and execute masterStep, why don't you remove that TYPE logic and instead you can SELECT * FROM table OFFSET ? LIMIT ? and handle offset and limit inside Partitioner? And then your 5 threads will handle this. If your final goal is to process every record in that table then you can simply use this without worrying about TYPE and executing them inside separate Step.

To user Tasklet or Chunk in this scenario

I have a job/task to read sub-folders/directory of a given folder/path. The path is dynamic, we get it from Controller. Currently, I have used Tasklet, there are 3 tasklets, one to read sub-directories, another to process it to prepare objects to save to DB and last one to write the processed data objects to a database.
The folders can have any number of sub-folders.Currently, I have used this code :
Path start = Paths.get("x:\\data\\");
Stream<Path> stream = Files.walk(start, 1);
List<String> collect = stream
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
To read all the sub folders at once.
I followed this https://www.baeldung.com/spring-batch-tasklet-chunk example of Tasklet implementation for the purpose. Is this the right approach ? I also need to run the Job asynchronously with multi-threading.
As there can be huge numbers of sub-folders, so there can be huge number of rowsorlist of data to process and write to the database.
Please suggest an appropriate approach.
I am learning Spring Batch, have done few examples on file read/process/write too and used Chunk approach for this.
But my job is to read sub-directories of a folder/path, so I cannot decide which approach to follow.
I have a similar scenario: I need to read all the files from a folder, process and write in db, (Doc)
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Bean
public Job job(JobBuilderFactory jobBuilderFactory,
Step masterStep) {
return jobBuilderFactory.get("MainJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep)
.end()
.build();
}
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
JdbcBatchItemWriter<Transaction> writer,
ItemReader<String> reader,
TransactionItemProcessor processor) {
return stepBuilderFactory.get("Main")
.<String, Transaction>chunk(2)
.reader(reader)
.processor(processor)
.writer(writer)
**.taskExecutor(jobTaskExecutor())**
.listener(new ItemReaderListener())
.build();
}
#Bean
public TaskExecutor jobTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(2);
taskExecutor.setMaxPoolSize(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public ItemReader<String> reader(#Value("#{stepExecution}") StepExecution stepExecution) throws IOException {
Path start = Paths.get("D:\\test");
List<String> inputFile = Files.walk(start, 1)
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
return new IteratorItemReader<>(inputFile);
}
#Bean
#StepScope
public TransactionItemProcessor processor(#Value("#{stepExecution}") StepExecution stepExecution) {
return new TransactionItemProcessor();
}
#Bean
#StepScope
public JdbcBatchItemWriter<Transaction> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<Transaction>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("INSERT INTO transaction (id, date, type) VALUES (:id, :date, :type)")
.dataSource(dataSource)
.build();
}
}

Partitioned Spring Batch Step repeats the same successful slave StepExecutions

Using Spring Batch 3.0.4.RELEASE.
I configure a job to use a partition step. The slave step uses chunk size 1. There are six threads in the task executor. I run this test with various grid sizes from six to hundreds. My grid size is the number of slave StepExecutions I expect == the number of ExecutionContexts created by my partitioner.
The result is always this:
The six threads pick up six different step executions and execute them successfully. Then the same six step executions run again and again in the same thread!
I notice that there is a loop in RepeatTemplate.executeInternal(...) that never ends. It keeps executing the same StepExecution just incrementing the version.
Here's the Java configuration code:
#Bean
#StepScope
public RapRequestItemReader rapReader(
#Value("#{stepExecutionContext['" + RapJobConfig.LIST_OF_IDS_STEP_EXECUTION_CONTEXT_VAR + "']}") String listOfIds,
final #Value("#{stepExecutionContext['" + RapJobConfig.TIME_STEP_EXECUTION_CONTEXT_VAR + "']}") String timeString) {
final List<Asset> farms = Arrays.asList(listOfIds.split(",")).stream().map(intString -> assetDao.getById(Integer.valueOf(intString)))
.collect(Collectors.toList());
return new RapRequestItemReader(timeString, farms);
}
#Bean
public ItemProcessor<RapRequest, PullSuccess> rapProcessor() {
return rapRequest -> {
return rapPull.pull(rapRequest.timestamp, rapRequest.farms);
};
}
#Bean
public TaskletStep rapStep1(StepBuilderFactory stepBuilderFactory, RapRequestItemReader rapReader) {
return stepBuilderFactory.get(RAP_STEP_NAME)
.<RapRequest, PullSuccess> chunk(RAP_STEP_CHUNK_SIZE)
.reader(rapReader)
.processor(rapProcessor())
.writer(updateCoverageWriter)
.build();
}
private RapFilePartitioner createRapFilePartitioner(RapParameter rapParameter) {
RapFilePartitioner partitioner = new RapFilePartitioner(rapParameter, rapPull.getIncrementHours());
return partitioner;
}
#Bean
public ThreadPoolTaskExecutor pullExecutor() {
ThreadPoolTaskExecutor pullExecutor = new ThreadPoolTaskExecutor();
pullExecutor.setCorePoolSize(weatherConfig.getNumberOfThreadsPerModelType());
pullExecutor.setMaxPoolSize(weatherConfig.getNumberOfThreadsPerModelType());
pullExecutor.setAllowCoreThreadTimeOut(true);
return pullExecutor;
}
#Bean
#JobScope
public Step rapPartitionByTimestampStep(StepBuilderFactory stepBuilderFactory, #Value("#{jobParameters['config']}") String config,
TaskletStep rapStep1) {
RapParameter rapParameter = GsonHelper.fromJson(config, RapParameter.class);
int gridSize = calculateGridSize(rapParameter);
return stepBuilderFactory.get("rapPartitionByTimestampStep")
.partitioner(rapStep1)
.partitioner(RAP_STEP_NAME, createRapFilePartitioner(rapParameter))
.taskExecutor(pullExecutor())
.gridSize(gridSize)
.build();
}
#Bean
public Job rapJob(JobBuilderFactory jobBuilderFactory, Step rapPartitionByTimestampStep) {
return jobBuilderFactory.get(JOB_NAME)
.start(rapPartitionByTimestampStep)
.build();
}
Though it's hard to tell this from the question, the problem was in the reader. The ItemReader was never returning null.
In the design, a StepExecution was supposed to process only one item. However, after processing that item, the ItemReader was returning that same item again instead of returning null.
I fixed it by having the ItemReader return null the second time read is called.
A better design might be to use a TaskletStep instead of a ChunkStep.

Multiple itemwriters in Spring batch

I am currently writing a Spring batch where I am reading a chunk of data, processing it and then I wish to pass this data to 2 writers. One writer would simply update the database whereas the second writer will write to a csv file.
I am planning to write my own custom writer and inject the two itemWriters in the customItemWriter and call the write methods of both the item writers in the write method of customItemWriter. Is this approach correct? Are there any ItemWriter implementations available which meet my requirements?
Thanks in advance
You can use Spring's CompositeItemWriter and delegate to it all your writers.
here is a configuration example.
You don't necessarily have to use xml like the example. If the rest of your code uses annotation, you could simply do the following.
public ItemWriter<T> writerOne(){
ItemWriter<T> writer = new ItemWriter<T>();
//your logic here
return writer;
}
public ItemWriter<T> writerTwo(){
ItemWriter<T> writer = new ItemWriter<T>();
//your logic here
return writer;
}
public CompositeItemWriter<T> compositeItemWriter(){
CompositeItemWriter writer = new CompositeItemWriter();
writer.setDelegates(Arrays.asList(writerOne(),writerTwo()));
return writer;
}
You were right. SB is heavly based on delegation so using a CompositeItemWriter is the right choice for your needs.
Java Config way SpringBatch4
#Bean
public Step step1() {
return this.stepBuilderFactory.get("step1")
.<String, String>chunk(2)
.reader(itemReader())
.writer(compositeItemWriter())
.stream(fileItemWriter1())
.stream(fileItemWriter2())
.build();
}
/**
* In Spring Batch 4, the CompositeItemWriter implements ItemStream so this isn't
* necessary, but used for an example.
*/
#Bean
public CompositeItemWriter compositeItemWriter() {
List<ItemWriter> writers = new ArrayList<>(2);
writers.add(fileItemWriter1());
writers.add(fileItemWriter2());
CompositeItemWriter itemWriter = new CompositeItemWriter();
itemWriter.setDelegates(writers);
return itemWriter;
}
Depending on your need, another option is to extend the Writer class and add functionality there. For example, I have a project where I am extending HibernateItemWriter and then overriding write(List items). I then send the objects I am writing along with my sessionFactory to the doWrite method of the Writer: doWrite(sessionFactory, filteredRecords).
So in the example above, I could write to the csv file in my extended class and then the HibernateItemWriter would write to the database. Obviously this might not be ideal for this example, but for certain scenarios it is a nice option.
Here's a possible solution. Two writers inside a Composite Writer.
#Bean
public JdbcBatchItemWriter<XPTO> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<XPTO>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("UPDATE xxxx")
.dataSource(dataSource)
.build();
}
#Bean
public JdbcBatchItemWriter<XPTO> writer2(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<XPTO>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("UPDATE yyyyy")
.dataSource(dataSource)
.build();
}
#Bean
public CompositeItemWriter<XPTO> compositeItemWriter(DataSource dataSource) {
CompositeItemWriter<XPTO> compositeItemWriter = new CompositeItemWriter<>();
compositeItemWriter.setDelegates(Arrays.asList( writer(dataSource), writer2(dataSource)));
return compositeItemWriter;
}
#Bean
protected Step step1(DataSource datasource) {
return this.stepBuilderFactory.get("step1").
<XPTO, XPTO>chunk(1).
reader(reader()).
processor(processor()).
writer(compositeItemWriter(datasource)).
build();
}

Categories