How to pass arguments from slave steps to reader in Spring Batch? - java

I have a spring batch process that is reading data from a database. Basically what happens is I have a SQL query that needs to get data by a column (type) value. That column has 50 different values. So there are 50 queries and each is executed on a separate slave step. But the query is building inside the Reader. So I need to pass each type to the Reader to build the query and read data. I am using Partitioner to separate the query with Offset and Limit.
Here is the code I have,
private Flow flow(List<Step> steps) {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(1);
return new FlowBuilder<SimpleFlow>("flow")
.split(taskExecutor).add(steps.stream().map(step -> new FlowBuilder<Flow>("flow_" + step.getName())
.start(step).build()).toArray(Flow[]::new)).build();
}
#Bean
public Job job() {
List<Step> masterSteps = TYPES.stream().map(this::masterStep).collect(Collectors.toList());
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.start(flow(masterSteps))
.end()
.build();
}
#Bean
#SneakyThrows
public Step slaveStep(String type) {
return stepBuilderFactory.get("slaveStep")
.<User, User>chunk(100)
.reader(reader(type, 0, 0))
.writer(writer())
.build();
}
#Bean
#SneakyThrows
public Step masterStep(String type) {
return stepBuilderFactory.get("masterStep")
.partitioner(slaveStep(type).getName(), partitioner(0))
.step(slaveStep(type))
.gridSize(5)
.taskExecutor(executor)
.build();
}
#Bean
#StepScope
#SneakyThrows
public JdbcCursorItemReader<User> reader(String type,
#Value("#{stepExecutionContext['offset']}") Integer offset,
#Value("#{stepExecutionContext['limit']}") Integer limit) {
String query = MessageFormat.format(SELECT_QUERY, type, offset, limit); // Ex: SELECT * FROM users WHERE type = 'type' OFFSET 500 LIMIT 1000;
JdbcCursorItemReader<User> itemReader = new JdbcCursorItemReader<>();
itemReader.setSql(query);
itemReader.setDataSource(dataSource);
itemReader.setRowMapper(new UserMapper());
itemReader.afterPropertiesSet();
return itemReader;
}
#Bean
#StepScope
public ItemWriter<User> writer() {
return new Writer();
}
#Bean
#StepScope
public Partitioner partitioner(#Value("#{jobParameters['limit']}") int limit) {
return new Partitioner(limit);
}
The issue I am using is to reader() method the type value is not passing. And even when I am adding #Bean annotation it is saying Could not autowire. No beans of 'String' type found.. If I didn't put #Bean offset and limit is always 0 because #Value is not populating. Right now when I am executing the batch nothing happens inside reader because type is null. When I am hardcoding the value it is working. So how can I fix this? Thanks in advance.

If you are iterating every TYPE and execute masterStep, why don't you remove that TYPE logic and instead you can SELECT * FROM table OFFSET ? LIMIT ? and handle offset and limit inside Partitioner? And then your 5 threads will handle this. If your final goal is to process every record in that table then you can simply use this without worrying about TYPE and executing them inside separate Step.

Related

Threading a variable number of heterogeneous input files, process the input and output to a single file

I am trying improve the performance of the job listed below. As is, without threading, it runs successfully. But is runs very slow. I would like to thread step 2 where 95% of the work happens in the reading, filtering and transforming the input data read from very large heterogeneous files. The job:
• Step1 gets some job parameters that are passed into Step2.
• Step2 will read in X number of files. Each file is heterogenous, i.e., contains several different record formats. The records are filtered, transformed and sent to a single output file.
Does Spring Batch have a built-in way to thread Step2 in this scenario? For example, can I add some type of executor to step2? I’ve tried SimpleAsyncTaskExecutor and ThreadPoolTaskExecutor. Neither work. Adding SimpleAsyncTaskExecutor throws an exception. (See can we process the multiple files sequentially using spring Batch while multiple threads used to process individual files data..?)
Here is the batch configuration:
public Job job() {
return jobBuilderFactory.get("MyJob")
.start(step1())
.next(step2())
.build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("Step1GetJobParams")
.tasklet(MyParamsTasklet)
.build();
}
#Bean
public Step step2() {
return stepBuilderFactory.get("Step2")
.<InputDO, OutputDO>chunk(1000)
.reader(myMultiResourceReader())
.processor(myStep2ItemProcessor)
.writer(myStep2FileWriter())
.taskExecutor(???) line #23
.build();
}
#Bean
public MultiResourceItemReader<InputDO> myMultiResourceReader(){
MultiResourceItemReader<InputDO> multiResourceItemReader = new MultiResourceItemReader<InputDO>();
multiResourceItemReader.setResources(resourceManager.getResources());
multiResourceItemReader.setDelegate(myStep2FileReader());
multiResourceItemReader.setSaveState(false);
return multiResourceItemReader;
}
#Bean
public FlatFileItemReader<InputDO> myStep2FileReader() {
return new FlatFileItemReaderBuilder<InputDO>()
.name("MyStep2FileReader")
.lineMapper(myCompositeLineMapper())
.build();
}
#Bean
public PatternMatchingCompositeLineMapper<InputDO> myCompositeLineMapper() {
PatternMatchingCompositeLineMapper<InputDO> lineMapper = new PatternMatchingCompositeLineMapper<InputDO>();
Map<String, LineTokenizer> tokenizers = new HashMap<String, LineTokenizer>();
tokenizers.put("A", InputDOTokenizer.getInputDOTokenizer());
tokenizers.put("*", InputDOFillerTokenizer.getInputDOFillerTokenizer());
lineMapper.setTokenizers(tokenizers);
Map<String, FieldSetMapper<InputDO>> mappers = new HashMap<String, FieldSetMapper<InputDO>>();
mappers.put("A", new InputDOFieldSetMapper());
mappers.put("*", new InputDOFillerFieldSetMapper());
lineMapper.setFieldSetMappers(mappers);
return lineMapper;
}
#Bean
public FlatFileItemWriter<OutputDO> myOutputDOFileWriter() {
return new FlatFileItemWriterBuilder<OutputDO>()
.name("MyOutputDOFileWriter")
.resource(resourceManager.getFileSystemResource("myOutputDOFileName"))
.lineAggregator(new DelimitedLineAggregator<OutputDO>() {
{
setDelimiter("");
setFieldExtractor(outputDOFieldExtractor.getOutputDOFieldExtractor());
};
})
.lineSeparator("\r\n")
.build();
}
Any/all guidance is much appreciated!
I guess you want to use this mode of Multi-threaded Step to resolve read slowly problem. more details is available from spring batch office - Multi-threaded Step about it.
Hope to help you.

Set chunksize dynamically after fetching from db

I need to set the chunk-size dynamically in a spring batch job's step which is stored in the database i.e the chunksize needs to be fetched from the database and set into the bean.
My Query is something like:
select CHUNK_SIZE from SOME_TABLE_NAME where ID='some_id_param_value'
Here the value for ID would come from the job parameters which is set via a request param passed with the request into the Rest Controller(while triggering the batch job)
I want to fetch this CHUNK_SIZE from the database and set it dynamically into the job's step.
Our requirement is that the chunksize varies for the step based on the ID value, the details of which are stored in a db table. For example:
ID
CHUNK_SIZE
01
1000
02
2500
I know that the beans in a job are set at the configuration time, and the job parameters are passed at the runtime while triggering the job.
EDIT:
The example provided by MahmoudBenHassine uses #JobScope and accesses the jobParameters in the step bean using #Value("#{jobParameters['id']}"). I tried implementing a similar approach using the jobExecutionContext as follows:
Fetched the chunkSize from the db table in the
StepExecutionListener's beforeStep method and set it in the
ExecutionContext.
Annotated the step bean with #JobScope and used
#Value("#{jobExecutionContext['chunk']}") to access it in the step
bean.
But I face the following error:
Error creating bean with name 'scopedTarget.step' defined in class path resource [com/sample/config/SampleBatchConfig.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [org.springframework.batch.core.Step]: Factory method 'step' threw exception; nested exception is java.lang.NullPointerException
It is not able to access the 'chunk' key-value from the jobExecutionContext, thus throwing the NullPointerException.
Does it need to be promoted somehow so that it can be accessed in the step bean? If yes, a quick sample or a direction would be really appreciated.
My Controller class:
#RestController
public class SampleController {
#Autowired
JobLauncher sampleJobLauncher;
#Autowired
Job sampleJob;
#GetMapping("/launch")
public BatchStatus launch(#RequestParam(name = "id", required = true) String id){
Map<String, JobParameter> map = new HashMap<>();
map.put("id", new JobParameter(id));
map.put("timestamp", new JobParameter(System.currentTimeMillis));
JobParameters params = new JobParameters(map);
JobExecution j = sampleJobLauncher.run(sampleJob, params);
return j.getStatus();
}
}
My batch config class(containing job and step bean):
#Configuration
public class SampleBatchConfig{
#Autowired
private JobBuilderFactory myJobBuilderFactory;
#Autowired
private StepBuilderFactory myStepBuilderFactory;
#Autowired
private MyRepoClass myRepo; // this class contains the jdbc method to fetch chunksize from the db table
#Autowired
MyReader myReader;
#Autowired
MyWriter myWriter;
#Bean
#JobScope
public Step sampleStep(#Value("#{jobExecutionContext['chunk']}") Integer chunkSize){
return myStepBuilderFactory.get("sampleStep")
.<MyClass, MyClass>chunk(chunkSize) //TODO ~instead of hardcoding the chunkSize or getting it from the properties file using #Value, the requirement is to fetch it from the db table using the above mentioned query with id job parameter and set it here
.reader(myReader.sampleReader())
.writer(myWriter.sampleWriter())
.listener(new StepExecutionListener() {
#Override
public void beforeStep(StepExecution stepExecution) {
int chunk = myRepo.findChunkSize(stepExecution.getJobExecution().getExecutionContext().get("id")); // this method call fetches chunksize from the db table using the id job parameter
stepExecution.getJobExecution().getExecutionContext().put("chunk", chunk);
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
return null;
}
})
.build();
}
#Bean
public Job job(){
return myJobBuilderFactory.get("sampleJob")
.incrementer(new RunIdIncrementer())
.start(sampleStep(null))
.build();
}
}
NOTE:
The job may have multiple steps with different chunkSizes, and in that case chunkSize is to be fetched separately for each step.
EDIT 2:
Changing my step definition as follows works, but there is a problem.
Here the reader reads a list having 17 items, in a chunk of size 4.
#Bean
#JobScope
public Step sampleStep(#Value("#{jobParameters['id']}") Integer id){
int chunkSize = myRepo.findChunkSize(id); // this method call fetches chunksize from the db table using the id job parameter
return myStepBuilderFactory.get("sampleStep")
.<MyClass, MyClass>chunk(chunkSize)
.reader(myReader.sampleReader())
.writer(myWriter.sampleWriter())
.listener(new ChunkListenerSupport() {
#Override
public void afterChunk(ChunkContext context) {
System.out.println("MyJob.afterChunk");
}
#Override
public void beforeChunk(ChunkContext context) {
System.out.println("MyJob.beforeChunk");
}
})
.build();
}
The first time I trigger the job from the url, it works fine and prints the following: (The chunk Size is set to 4 in the db table)
2021-05-03 15:06:44.859 INFO 11924 --- [nio-8081-exec-1] o.s.batch.core.job.SimpleStepHandler : Executing step: [sampleStep]
MyJob.beforeChunk
item = 1
item = 2
item = 3
item = 4
MyJob.afterChunk
MyJob.beforeChunk
item = 5
item = 6
item = 7
item = 8
MyJob.afterChunk
MyJob.beforeChunk
item = 9
item = 10
item = 11
item = 12
MyJob.afterChunk
MyJob.beforeChunk
item = 13
item = 14
item = 15
item = 16
MyJob.afterChunk
MyJob.beforeChunk
item = 17
MyJob.afterChunk
But if I trigger the job again, without restarting the server/spring container, the following is printed:
2021-05-03 15:11:02.427 INFO 11924 --- [nio-8081-exec-4] o.s.batch.core.job.SimpleStepHandler : Executing step: [sampleStep]
MyJob.beforeChunk
MyJob.afterChunk
In Short, it works fine for exactly once, when the server is restarted. But it doesn't work for the subsequent job executions without restarting the server.
Since you pass the ID as a job parameter and you want to get the chunk size dynamically from the database based on that ID while configuring the step, you can use a job-scoped step as follows:
#Bean
#JobScope
public Step sampleStep(#Value("#{jobParameters['id']}") Integer id){
int chunkSize = myRepo.findChunkSize(id); // this method call fetches chunksize from the db table using the id job parameter
return myStepBuilderFactory.get("sampleStep")
.<MyClass, MyClass>chunk(chunkSize)
.reader(myReader.sampleReader())
.writer(myWriter.sampleWriter())
.build();
}

To user Tasklet or Chunk in this scenario

I have a job/task to read sub-folders/directory of a given folder/path. The path is dynamic, we get it from Controller. Currently, I have used Tasklet, there are 3 tasklets, one to read sub-directories, another to process it to prepare objects to save to DB and last one to write the processed data objects to a database.
The folders can have any number of sub-folders.Currently, I have used this code :
Path start = Paths.get("x:\\data\\");
Stream<Path> stream = Files.walk(start, 1);
List<String> collect = stream
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
To read all the sub folders at once.
I followed this https://www.baeldung.com/spring-batch-tasklet-chunk example of Tasklet implementation for the purpose. Is this the right approach ? I also need to run the Job asynchronously with multi-threading.
As there can be huge numbers of sub-folders, so there can be huge number of rowsorlist of data to process and write to the database.
Please suggest an appropriate approach.
I am learning Spring Batch, have done few examples on file read/process/write too and used Chunk approach for this.
But my job is to read sub-directories of a folder/path, so I cannot decide which approach to follow.
I have a similar scenario: I need to read all the files from a folder, process and write in db, (Doc)
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Bean
public Job job(JobBuilderFactory jobBuilderFactory,
Step masterStep) {
return jobBuilderFactory.get("MainJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep)
.end()
.build();
}
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
JdbcBatchItemWriter<Transaction> writer,
ItemReader<String> reader,
TransactionItemProcessor processor) {
return stepBuilderFactory.get("Main")
.<String, Transaction>chunk(2)
.reader(reader)
.processor(processor)
.writer(writer)
**.taskExecutor(jobTaskExecutor())**
.listener(new ItemReaderListener())
.build();
}
#Bean
public TaskExecutor jobTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(2);
taskExecutor.setMaxPoolSize(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public ItemReader<String> reader(#Value("#{stepExecution}") StepExecution stepExecution) throws IOException {
Path start = Paths.get("D:\\test");
List<String> inputFile = Files.walk(start, 1)
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
return new IteratorItemReader<>(inputFile);
}
#Bean
#StepScope
public TransactionItemProcessor processor(#Value("#{stepExecution}") StepExecution stepExecution) {
return new TransactionItemProcessor();
}
#Bean
#StepScope
public JdbcBatchItemWriter<Transaction> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<Transaction>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("INSERT INTO transaction (id, date, type) VALUES (:id, :date, :type)")
.dataSource(dataSource)
.build();
}
}

Can I use FlatfileItemReader with Taskexecutor?

Can I use FlatfileItemReader with Taskexecutor in spring batch??
I have implemented FlatFileItemReader with ThreadPoolTaskExecutor. When I print the records in ItemProcessor, I do not get consistent results, i.e. not all the records are printed and sometimes one of the record is printed more than once. It leads me to the fact that FlatFileItemReader is not thread safe and also it says the same in spring docs but I see some blogs where it says it is possible to use FlatFileItemReader with Task Executor.
So my question is: Is it possible to use FlatfileItemReader with Task Executor in anyway ?
#Bean
#StepScope
public FlatFileItemReader<DataLifeCycleEvent> csvFileReader(
#Value("#{stepExecution}") StepExecution stepExecution) {
Resource inputResource;
FlatFileItemReader<DataLifeCycleEvent> itemReader = new FlatFileItemReader<>();
itemReader.setLineMapper(new OnboardingLineMapper(stepExecution));
itemReader.setLinesToSkip(1);
itemReader.setSaveState(false);
itemReader.setSkippedLinesCallback(new OnboardingHeaderMapper(stepExecution));
String inputResourceString = stepExecution.getJobParameters().getString("inputResource");
inputResource = new FileSystemResource(inputFileLocation + ApplicationConstant.SLASH + inputResourceString);
itemReader.setResource(inputResource);
stepExecution.getJobExecution().getExecutionContext().putInt(ApplicationConstant.ERROR_COUNT, 0);
return itemReader;
}
FlatFileItemReader extends AbstractItemCountingItemStreamItemReader which is NOT thread-safe. So if you use it in a multi-threaded step, you need to synchronize it.
You can wrap it in a SynchronizedItemStreamReader. Here is a quick example:
#Bean
public SynchronizedItemStreamReader<DataLifeCycleEvent> itemReader() {
FlatFileItemReader<DataLifeCycleEvent> itemReader = ... // your item reader
SynchronizedItemStreamReader<DataLifeCycleEvent> synchronizedItemStreamReader = new SynchronizedItemStreamReader<>();
synchronizedItemStreamReader.setDelegate(itemReader);
return synchronizedItemStreamReader;
}
This method is giving this exception :-java.lang.ClassCastException: com.sun.proxy.$Proxy344 cannot be cast to org.springframework.batch.item.support.SynchronizedItemStreamReader
#Bean
public SynchronizedItemStreamReader<DataLifeCycleEvent> itemReader() {
FlatFileItemReader<DataLifeCycleEvent> itemReader = ... // your item reader
SynchronizedItemStreamReader<DataLifeCycleEvent> synchronizedItemStreamReader = new SynchronizedItemStreamReader<>();
synchronizedItemStreamReader.setDelegate(itemReader);
return synchronizedItemStreamReader;
}

Partitioned Spring Batch Step repeats the same successful slave StepExecutions

Using Spring Batch 3.0.4.RELEASE.
I configure a job to use a partition step. The slave step uses chunk size 1. There are six threads in the task executor. I run this test with various grid sizes from six to hundreds. My grid size is the number of slave StepExecutions I expect == the number of ExecutionContexts created by my partitioner.
The result is always this:
The six threads pick up six different step executions and execute them successfully. Then the same six step executions run again and again in the same thread!
I notice that there is a loop in RepeatTemplate.executeInternal(...) that never ends. It keeps executing the same StepExecution just incrementing the version.
Here's the Java configuration code:
#Bean
#StepScope
public RapRequestItemReader rapReader(
#Value("#{stepExecutionContext['" + RapJobConfig.LIST_OF_IDS_STEP_EXECUTION_CONTEXT_VAR + "']}") String listOfIds,
final #Value("#{stepExecutionContext['" + RapJobConfig.TIME_STEP_EXECUTION_CONTEXT_VAR + "']}") String timeString) {
final List<Asset> farms = Arrays.asList(listOfIds.split(",")).stream().map(intString -> assetDao.getById(Integer.valueOf(intString)))
.collect(Collectors.toList());
return new RapRequestItemReader(timeString, farms);
}
#Bean
public ItemProcessor<RapRequest, PullSuccess> rapProcessor() {
return rapRequest -> {
return rapPull.pull(rapRequest.timestamp, rapRequest.farms);
};
}
#Bean
public TaskletStep rapStep1(StepBuilderFactory stepBuilderFactory, RapRequestItemReader rapReader) {
return stepBuilderFactory.get(RAP_STEP_NAME)
.<RapRequest, PullSuccess> chunk(RAP_STEP_CHUNK_SIZE)
.reader(rapReader)
.processor(rapProcessor())
.writer(updateCoverageWriter)
.build();
}
private RapFilePartitioner createRapFilePartitioner(RapParameter rapParameter) {
RapFilePartitioner partitioner = new RapFilePartitioner(rapParameter, rapPull.getIncrementHours());
return partitioner;
}
#Bean
public ThreadPoolTaskExecutor pullExecutor() {
ThreadPoolTaskExecutor pullExecutor = new ThreadPoolTaskExecutor();
pullExecutor.setCorePoolSize(weatherConfig.getNumberOfThreadsPerModelType());
pullExecutor.setMaxPoolSize(weatherConfig.getNumberOfThreadsPerModelType());
pullExecutor.setAllowCoreThreadTimeOut(true);
return pullExecutor;
}
#Bean
#JobScope
public Step rapPartitionByTimestampStep(StepBuilderFactory stepBuilderFactory, #Value("#{jobParameters['config']}") String config,
TaskletStep rapStep1) {
RapParameter rapParameter = GsonHelper.fromJson(config, RapParameter.class);
int gridSize = calculateGridSize(rapParameter);
return stepBuilderFactory.get("rapPartitionByTimestampStep")
.partitioner(rapStep1)
.partitioner(RAP_STEP_NAME, createRapFilePartitioner(rapParameter))
.taskExecutor(pullExecutor())
.gridSize(gridSize)
.build();
}
#Bean
public Job rapJob(JobBuilderFactory jobBuilderFactory, Step rapPartitionByTimestampStep) {
return jobBuilderFactory.get(JOB_NAME)
.start(rapPartitionByTimestampStep)
.build();
}
Though it's hard to tell this from the question, the problem was in the reader. The ItemReader was never returning null.
In the design, a StepExecution was supposed to process only one item. However, after processing that item, the ItemReader was returning that same item again instead of returning null.
I fixed it by having the ItemReader return null the second time read is called.
A better design might be to use a TaskletStep instead of a ChunkStep.

Categories