First of all, thank you for visiting my post.
I have a Spring Batch Project that is using tasklet and chunk.
The chunk aka step1 is where I am processing data and generate a new csv file in the s3 bucket.
The tasklet aka step2 is where I am reading the csv file from s3 bucket that was generated in step1 and sending SNS topic.
Right now, I have a problem with the SNSTopicSender class that implements Tasklet.
first of all, here are the config class and SNSTopicSender class.
#Bean
Step step1() {
return this.stepBuilderFactory
.get("step1")
.<BadStudent, BadStudent>chunk(100)
.reader(new IteratorItemReader<Student>(this.StudentLoader.Students.iterator()) as ItemReader<? extends BadStudent>)
.processor(this.studentProcessor as ItemProcessor<? super BadStudent, ? extends BadStudent>)
.writer(this.csvWriter())
.build()
}
#Bean
Step step2() {
return this.stepBuilderFactory
.get("step2")
.tasklet(new PublishSnsTopic())
.build()
}
#Bean
Job job() {
return this.jobBuilderFactory
.get("scoring-students-batch")
.incrementer(new RunIdIncrementer())
.start(this.step1())
.next(this.step2())
.build()
}
#Configuration
#Service
class SNSTopicSender implements Tasklet {
#Autowired
ResourceLoader resourceLoader
List<BadStudent> badStudents
#Autowired
FileProperties fileProperties
#PostConstruct
void setup() {
String badStudentCSVFileName = "s3://students/failedStudents.csv"
Reader badStudentReader = new InputStreamReader(
this.resourceLoader.getResource(badStudentCSVFileName).inputStream
)
// create message body and call the publishTopic function
}
void publishTopic(SnsClient snsClient, String message, String arn) {
// sending a topic
}
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
return RepeatStatus.FINISHED
}
}
I used #PostConstruct because without it, resoruceLoader and fileProperties will be null as it would not have Injection of bean as you know.
To work around it, I used #PostConstruct.
However, I recently realized that SNSTopicSender class is not reading the csv file just created, but it is reading the csv file that was already there before this batch job ran. Because the #PostConstruct gets fired before step1 gets completed.(where the target csv file is created).
Then if I removed the #PostConstruct, then resourceLoader and fileProperties will be null, meaning this batch job would not know where the csv file is stored and also how to read it.
Can anyone help me with this please?
Related
I have a job/task to read sub-folders/directory of a given folder/path. The path is dynamic, we get it from Controller. Currently, I have used Tasklet, there are 3 tasklets, one to read sub-directories, another to process it to prepare objects to save to DB and last one to write the processed data objects to a database.
The folders can have any number of sub-folders.Currently, I have used this code :
Path start = Paths.get("x:\\data\\");
Stream<Path> stream = Files.walk(start, 1);
List<String> collect = stream
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
To read all the sub folders at once.
I followed this https://www.baeldung.com/spring-batch-tasklet-chunk example of Tasklet implementation for the purpose. Is this the right approach ? I also need to run the Job asynchronously with multi-threading.
As there can be huge numbers of sub-folders, so there can be huge number of rowsorlist of data to process and write to the database.
Please suggest an appropriate approach.
I am learning Spring Batch, have done few examples on file read/process/write too and used Chunk approach for this.
But my job is to read sub-directories of a folder/path, so I cannot decide which approach to follow.
I have a similar scenario: I need to read all the files from a folder, process and write in db, (Doc)
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Bean
public Job job(JobBuilderFactory jobBuilderFactory,
Step masterStep) {
return jobBuilderFactory.get("MainJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep)
.end()
.build();
}
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
JdbcBatchItemWriter<Transaction> writer,
ItemReader<String> reader,
TransactionItemProcessor processor) {
return stepBuilderFactory.get("Main")
.<String, Transaction>chunk(2)
.reader(reader)
.processor(processor)
.writer(writer)
**.taskExecutor(jobTaskExecutor())**
.listener(new ItemReaderListener())
.build();
}
#Bean
public TaskExecutor jobTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(2);
taskExecutor.setMaxPoolSize(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public ItemReader<String> reader(#Value("#{stepExecution}") StepExecution stepExecution) throws IOException {
Path start = Paths.get("D:\\test");
List<String> inputFile = Files.walk(start, 1)
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
return new IteratorItemReader<>(inputFile);
}
#Bean
#StepScope
public TransactionItemProcessor processor(#Value("#{stepExecution}") StepExecution stepExecution) {
return new TransactionItemProcessor();
}
#Bean
#StepScope
public JdbcBatchItemWriter<Transaction> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<Transaction>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("INSERT INTO transaction (id, date, type) VALUES (:id, :date, :type)")
.dataSource(dataSource)
.build();
}
}
I am pretty new to spring technology. I am trying to build an ETL like app using spring batch with spring boot.
Able to run the basic job (read->process->write). Now, I want to read the arguments (like date, file name, type, etc) from a config file (later) or command line (can work with it now) and use them in my job.
Entry point:
// Imports
#SpringBootApplication
#EnableBatchProcessing
public class EtlSpringBatchApplication {
public static void main(String[] args) {
SpringApplication.run(EtlSpringBatchApplication.class, args);
}
}
My batch configuration
// BatchConfig.java
// Imports
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public MyDao myDao;
#Bean
public Job job() {
return jobBuilderFactory
.get("job")
.incrementer(new RunIdIncrementer())
.listener(new Listener(myDao))
.flow(step1())
.end()
.build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("step1").<myModel, myModel>chunk(1000)
.reader(Reader.reader("my_file_20200520.txt"))
.processor(new Processor())
.writer(new Writer(myDao))
.build();
}
I have basic steps steps.
Reader.java has method to read flat file.
public static FlatFileItemReader<MyModel> reader(String path) {......}
Processor.java has process method defined. I added a #BeforeStep to fetch some details from DB required for processing.
public class Processor implements ItemProcessor<MyModel, MyModel> {
private static final Logger log = LoggerFactory.getLogger(Processor.class);
private Long id = null;
#BeforeStep
public void getId(StepExecution stepExecution) {
this.id = stepExecution.getJobExecution().getExecutionContext().getLong("Id");
}
#Override
public MyModel process(MyModel myModel) throws Exception {
}
}
Writer.java is implementing ItemWriter and write code.
Listener.java extends JobExecutionListenerSupport and has overridden methods afterJob and beforeJob.
Basically tried to use executioncontext here in beforeJob.
#Override
public void beforeJob(JobExecution jobExecution) {
log.info("Getting the id..");
this.id = myDao.getLatestId();
log.info("id retrieved is: " + this.id);
jobExecution.getExecutionContext().putLong("Id", this.id);
}
Now, what I am looking for is:
The reader should get the file name from job arguments. i.e. when run the job, I should be able to give some arguments, one of them is file path.
Later some methods (like get id, etc) require few more variables which can be passed as arguments to job i.e. run_date, type, etc.
In short I am looking for a way to,
Pass job arguments to my app (run_date, type, file path etc)
Use them in reader and other places (Listener, Writer)
Can someone provide me what addiitons I should do in my BatchConfig.java and other places, to read the job parameters (from command line or config file, whichever is easy)?
Both Spring Batch and Spring Boot reference documentation show how to pass parameters to a job:
Running Jobs from the Command Line
Running Spring Batch jobs from the Command Line
Moreover, Spring Batch docs explain in details and with code examples how to use those parameters in batch components (like reader, writer, etc):
Late Binding of Job and Step Attributes
You can read the value of the of the job parameters set from the config file inside the reader or other classes within the spring batch execution context. Below is a snippet for reference,
application.yml file can have the below config,
batch.configs.filePath: c:\test
You can add the filePath read from the config to your job parameters when you start the job. Snippet of the class,
// Job and Job Launcher related autowires..
#Value("${batch.configs.filePath}")
private String filePath;
// inside a method block,
JobParameters jobParameters = new JobParametersBuilder().addLong("JobID", System.currentTimeMillis())
.addString("filePath", filePath).toJobParameters();
try {
jobLauncher.run(batchJob, jobParameters);
} catch (Exception e) {
logger.error("Exception while running a batch job {}", e.getMessage());
}
One of the ways to access the Job Parameters is to implement StepExecutionListener to your reader Class to make use of its Overridden methods beforeStep and afterStep. Similar implementations can be performed to other classes as well,
public class Reader implements ItemReader<String>, StepExecutionListener {
private String filePath;
#Override
public void beforeStep(StepExecution stepExecution) {
try {
filePath = (String) stepExecution.getJobExecution().getExecutionContext()
.get("filePath");
} catch (Exception e) {
logger.error("Exception while performing read {}", e);
}
}
#Override
public String read() throws Exception {
// filePath value read from the job execution can be used inside read use case impl
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
return ExitStatus.COMPLETED;
}
}
I configured a spring batch to skip a bad record when there is an error reading the xml file. The skipPolicy implementation always return true in order to skip the bad record.
The job need to continue processing the rest of the records, however in my case it stops after the bad record as completed.
#Configuration
#Import(DataSourceConfig.class)
#EnableWebMvc
#ComponentScan(basePackages = "org.nova.batch")
#EnableBatchProcessing
public class BatchIssueConfiguration {
private static final Logger LOG =LoggerFactory.getLogger(BatchIssueConfiguration.class);
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Bean(name = "jobRepository")
public JobRepository jobRepository(DataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
factory.setDatabaseType("derby");
factory.setDataSource(dataSource);
factory.setTransactionManager(transactionManager);
return factory.getObject();
}
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skipPolicy(new FileVerificationSkipper())
.skip(Throwable.class)
.build();
}
#Bean
public SkipPolicy fileVerificationSkipper(){
return new FileVerificationSkipper();
}
#Bean
#JobScope
public MultiResourceItemReader<SigmodRecord.Issue> multiReader() throws IOException{
MultiResourceItemReader<SigmodRecord.Issue> mrir = new MultiResourceItemReader<SigmodRecord.Issue>();
//FileSystemResource [] files = new FileSystemResource [{}];
ResourcePatternResolver rpr = new PathMatchingResourcePatternResolver();
Resource[] resources = rpr.getResources("file:c:/temp/Sigm*.xml");
mrir.setResources( resources);
mrir.setDelegate(xmlItemReader());
return mrir;
}
}
public class FileVerificationSkipper implements SkipPolicy {
private static final Logger LOG = LoggerFactory.getLogger(FileVerificationSkipper.class);
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
LOG.error("There is an error {}",t);
return true;
}
}
The file has inputs which includes "&" that causes the reading error i.e.
<title>Notes of DDTS & n Apparatus for Experimental Research</title>
which throws the following error:
org.springframework.dao.DataAccessResourceFailureException: Error reading XML stream; nested exception is javax.xml.stream.XMLStreamException: ParseError at [row,col]:[127,25]
Message: The entity name must immediately follow the '&' in the entity reference.
Is there anything I'm doing wrong in my configuration that does not allow the rest of the records to continue processing.
To skip for certain type of exceptions we can either mention the skip policy where we can write custom logic for skipping a exception. Like below code.
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipPolicy(new FileVerificationSkipper())
.build();
}
public class FileVerificationSkipper implements SkipPolicy {
private static final Logger LOG = LoggerFactory.getLogger(FileVerificationSkipper.class);
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
LOG.error("There is an error {}",t);
if (t instanceof DataAccessResourceFailureException)
return true;
}
}
Or you can simply setup like below.
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skip(DataAccessResourceFailureException.class)
.build();
}
This issue falls under malformed xml and it seems that there is no way to recover from that except fixing the xml itself. The spring StaxEventItemReader is using XMLEventReader in its low parse of the xml, so I tried to read the xml file using XMLEventReader to try and skip the bad block, however XMLEventReader.nextEvent() kept throwing an exception where the bad block is. I tried to handle that in try catch block in order to skip to next event but it seems that the reader wont move to the next event. So for now the only way to solve the issue is to fix the xml itself before processing it.
I am using Spring boot 2.0.5.RELEASE and running a batch process using this:
# prevent auto-start of batch jobs
spring:
batch:
job:
enabled: false
and triggering it manually using a controller end-point (in input are the parameters that are collection from user from the controller):
jobLauncher.run(job, new JobParametersBuilder()
.addDate("date", new Date())
.addJobParameters(new JobParameters(input)).toJobParameters());
Here is my batch configuration:
#Bean
public MongoItemReader<Document> reader() {
MongoItemReader<Document> reader = new MongoItemReader<>();
reader.setTemplate(mongoTemplate);
reader.setCollection(XML_PERSIST_COLLECTION);
reader.setQuery("{}");
Map<String, Sort.Direction> sorts = new HashMap<>(1);
sorts.put("status", Sort.Direction.ASC);
reader.setSort(sorts);
reader.setTargetType(Document.class);
return reader;
}
#Bean
#StepScope
public MyItemProcessor processor() {
return new MyItemProcessor();
}
#Bean
public MongoItemWriter<OutputDto> writer() {
MongoItemWriter<OutputDto> writer = new MongoItemWriter<>();
writer.setTemplate(mongoTemplate);
writer.setCollection(RESPONSE_COLLECTION);
return writer;
}
#Bean
public Step step() {
return stepBuilderFactory.get("step")
.<Document, OutputDto> chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.allowStartIfComplete(true)
.build();
#Bean
public Job job(Step step) {
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
and my processor:
public class MyItemProcessor implements ItemProcessor<Document, OutputDto> {
#Value("#{jobParameters['username']}")
private String username;
#Value("#{jobParameters['password']}")
private String password;
#Override
public OutputDto process(final Document document) throws Exception {
// implementation code
}
}
I am using #StepScope for the processor to extract the job parameters that are passed from my controller.
Issue:
Everything is fine except that the batch job will run only once after the app starts and it will not run again (it runs, but I tried keeping debug points in processor and it is not getting there). I am already adding a timestamp job parameter so that the batch job can be run again, yet the processor is not running more than once (when it should). Any ideas?
The reader() and writer() had singleton scope while the processor() had #StepScope - so looks like that's why the writer() was not getting invoked.
I added #StepScope to the reader and writer and now everything is working fine, though it didn't strike me as instinctive - should have worked without that in 2.0.5.RELEASE.
Introduction
I am trying to use jobparameters created in a tasklet to create steps following the execution of the tasklet.
A tasklet tries to finds some files (findFiles()) and if it finds some files it saves the filenames to a list of strings.
In the tasklet I pass the data as following:
chunkContext.getStepContext().getStepExecution().getExecutionContext().put("files", fileNames);
The next step is a parallel flow where for each file a simple reader-processor-writer step will be executed (if you are interested in how I got there please see my previous question: Spring Batch - Looping a reader/processor/writer step)
Upon building the job readFilesJob() a flow is created initially using a "fake" list of files because only after the tasklet has been executed the real list of files is known.
Question
How do I configure my job so the tasklet gets executed first and then the parallel flow gets executed using the list of files generated from the tasklet?
I think it comes down to getting the list of filenames loaded with the correct data at the correct moment during runtime... but how?
Reproduce
Here is my simplified configuration:
#Configuration
#EnableBatchProcessing
public class BatchConfiguration {
private static final String FLOW_NAME = "flow1";
private static final String PLACE_HOLDER = "empty";
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
public List<String> files = Arrays.asList(PLACE_HOLDER);
#Bean
public Job readFilesJob() throws Exception {
List<Step> steps = files.stream().map(file -> createStep(file)).collect(Collectors.toList());
FlowBuilder<Flow> flowBuilder = new FlowBuilder<>(FLOW_NAME);
Flow flow = flowBuilder
.start(findFiles())
.next(createParallelFlow(steps))
.build();
return jobBuilderFactory.get("readFilesJob")
.start(flow)
.end()
.build();
}
private static Flow createParallelFlow(List<Step> steps){
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(steps.size());
List<Flow> flows = steps.stream()
.map(step ->
new FlowBuilder<Flow>("flow_" + step.getName())
.start(step)
.build())
.collect(Collectors.toList());
return new FlowBuilder<SimpleFlow>("parallelStepsFlow").split(taskExecutor)
.add(flows.toArray(new Flow[flows.size()]))
.build();
}
private Step createStep(String fileName){
return stepBuilderFactory.get("readFile" + fileName)
.chunk(100)
.reader(reader(fileName))
.writer(writer(filename))
.build();
}
private FileFinder findFiles(){
return new FileFinder();
}
}
Research
The question and answer from How to safely pass params from Tasklet to step when running parallel jobs suggest the usage of a construct like this in the reader/writer:
#Value("#{jobExecutionContext[filePath]}") String filePath
However, I really hope it is possible to pass the fileName as a string to the reader/writer due to the way the steps are created in the createParallelFlow() method. Therefore, even tho the answer to that question might be a solution for my problem here, it is not the desired solution. But please do not refrain from correcting me if I am wrong.
Closing
I am using the file names example to clarify the problem better. My problem is not actually the reading of multiple files from a directory. My question really boils down to the idea of generating data during runtime and passing it to the next dynamically generated step(s).
EDIT:
Added a simplified tasklet of the fileFinder.
#Component
public class FileFinder implements Tasklet, InitializingBean {
List<String> fileNames;
public List<String> getFileNames() {
return fileNames;
}
#PostConstruct
public void afterPropertiesSet() {
// read the filenames and store dem in the list
fileNames.add("sample-data1.csv");
fileNames.add("sample-data2.csv");
}
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
// Execution of methods that will find the file names and put them in the list...
chunkContext.getStepContext().getStepExecution().getExecutionContext().put("files", fileNames);
return RepeatStatus.FINISHED;
}
}
I'm not sure, if I did understand your problem correctly, but as far as I see, you need to have the list with the filenames before you build your job dynamically.
You could do it like this:
#Component
public class MyJobSetup {
List<String> fileNames;
public List<String> getFileNames() {
return fileNames;
}
#PostConstruct
public void afterPropertiesSet() {
// read the filenames and store dem in the list
fileNames = ....;
}
}
After that, you can inject this Bean inside your JobConfiguration Bean
#Configuration
#EnableBatchProcessing
#Import(MyJobSetup.class)
public class BatchConfiguration {
private static final String FLOW_NAME = "flow1";
private static final String PLACE_HOLDER = "empty";
#Autowired
private MyJobSetup jobSetup; // <--- Inject
// PostConstruct of MyJobSetup was executed, when it is injected
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
public List<String> files = Arrays.asList(PLACE_HOLDER);
#Bean
public Job readFilesJob() throws Exception {
List<Step> steps = jobSetUp.getFileNames() // get the list of files
.stream() // as stream
.map(file -> createStep(file)) // map...
.collect(Collectors.toList()); // and create the list of steps