I am using Spring boot 2.0.5.RELEASE and running a batch process using this:
# prevent auto-start of batch jobs
spring:
batch:
job:
enabled: false
and triggering it manually using a controller end-point (in input are the parameters that are collection from user from the controller):
jobLauncher.run(job, new JobParametersBuilder()
.addDate("date", new Date())
.addJobParameters(new JobParameters(input)).toJobParameters());
Here is my batch configuration:
#Bean
public MongoItemReader<Document> reader() {
MongoItemReader<Document> reader = new MongoItemReader<>();
reader.setTemplate(mongoTemplate);
reader.setCollection(XML_PERSIST_COLLECTION);
reader.setQuery("{}");
Map<String, Sort.Direction> sorts = new HashMap<>(1);
sorts.put("status", Sort.Direction.ASC);
reader.setSort(sorts);
reader.setTargetType(Document.class);
return reader;
}
#Bean
#StepScope
public MyItemProcessor processor() {
return new MyItemProcessor();
}
#Bean
public MongoItemWriter<OutputDto> writer() {
MongoItemWriter<OutputDto> writer = new MongoItemWriter<>();
writer.setTemplate(mongoTemplate);
writer.setCollection(RESPONSE_COLLECTION);
return writer;
}
#Bean
public Step step() {
return stepBuilderFactory.get("step")
.<Document, OutputDto> chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.allowStartIfComplete(true)
.build();
#Bean
public Job job(Step step) {
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
and my processor:
public class MyItemProcessor implements ItemProcessor<Document, OutputDto> {
#Value("#{jobParameters['username']}")
private String username;
#Value("#{jobParameters['password']}")
private String password;
#Override
public OutputDto process(final Document document) throws Exception {
// implementation code
}
}
I am using #StepScope for the processor to extract the job parameters that are passed from my controller.
Issue:
Everything is fine except that the batch job will run only once after the app starts and it will not run again (it runs, but I tried keeping debug points in processor and it is not getting there). I am already adding a timestamp job parameter so that the batch job can be run again, yet the processor is not running more than once (when it should). Any ideas?
The reader() and writer() had singleton scope while the processor() had #StepScope - so looks like that's why the writer() was not getting invoked.
I added #StepScope to the reader and writer and now everything is working fine, though it didn't strike me as instinctive - should have worked without that in 2.0.5.RELEASE.
Related
I have a simple job with only one step, but in some way the Batch loops from reader to processor and then to reader again. I can't understand why.
This is the structure:
The reader makes a double select on the same database. The first select needs to search in the first table some records in some state and the second select needs to match those results, get some records from the second table and send them to processor that call an api for every record.
I need to stop the batch running at this point, so after the processor. But I have some problems with this.
Example of my batch:
#Configuration
#EnableBatchProcessing
#EnableScheduling
public class LoadIdemOperationJob {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public JobLauncher jobLauncher;
#Autowired
public JobRegistry jobRegistry;
#Scheduled(cron = "* */3 * * * *")
public void perform() throws Exception {
JobParameters jobParameters = new JobParametersBuilder()
.addString("JobID", String.valueOf(System.currentTimeMillis()))
.toJobParameters();
jobLauncher.run(jobRegistry.getJob("firstJob"), jobParameters);
}
#Bean
public Job firstJob(Step firstStep) {
return jobBuilderFactory.get("firstJob")
.start(firstStep)
.build();
}
#Bean
public Step firstStep(MyReader reader,
MyProcessor processor) {
return stepBuilderFactory.get("firstStep")
.<List<String>, List<String>>chunk(1)
.reader(reader)
.processor(processor)
.writer(new NoOpItemWriter())
.build();
}
#Bean
#StepScope
public MyReader reader(#Value("${hours}") String hours) {
return new MyReader(hours);
}
#Bean
public MyProcessor processor() {
return new MyProcessor();
}
public static class NoOpItemWriter implements ItemWriter<Object> {
#Override
public void write(#NonNull List<?> items) {
}
}
#Bean
public JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor() {
JobRegistryBeanPostProcessor postProcessor = new JobRegistryBeanPostProcessor();
postProcessor.setJobRegistry(jobRegistry);
return postProcessor;
}
#Bean
public RequestContextListener requestContextListener() {
return new RequestContextListener();
}
}
Example of Reader:
public class MyReader implements ItemReader<List<String>> {
public String hours;
private List<String> results;
#Autowired
private JdbcTemplate jdbcTemplate;
public MyReader(String hours) {
this.hours = hours;
}
#Override
public List<String> read() throws Exception {
results = this.jdbcTemplate.queryForList(// 1^ query, String.class);
if (results.isEmpty()) {
return null;
}
List<String> results = this.jdbcTemplate.queryForList(// 2^ query, String.class);
if (results.isEmpty()) {
return null;
}
return results;
}
}
And Processor:
public class MyProcessor implements ItemProcessor<List<String>, List<String>> {
#Override
public List<String> process(#NonNull List<String> results) throws Exception {
results.forEach(result -> // calling service);
return null;
}
}
Thanks for help!
What you are seeing is the implementation of the chunk-oriented processing model of Spring Batch, where items are read and processed in sequence one by one, and written in chunks.
That said, the design and configuration of your chunk-oriented step is not ideal: the reader returns a List of Strings (so an item in your case is the List itself, not an element from the list), the processor loops over the elements of each List (while it is not intended to do so), and finally there is no item writer (this is a sign that either you don't need a chunk-oriented step, or the step is not well designed).
I can recommend to modify your step design as follows:
The reader should return a single item and not a List. For example, by using the iterator of results and make the reader return iterator.next().
Remove the processor and move its code in the item writer. In fact, the item processor is optional in a chunk-oriented step
Create an item writer with the code of the item processor. Posting results to a REST endpoint is in fact a kind of write operation, so an item writer is definitely better suited than an item processor in this case.
With that design, you should see your chunk-oriented step reading and writing all items from your list without the impression that the job is "looping". This is actually the implementation of the pattern described above.
First of all, thank you for visiting my post.
I have a Spring Batch Project that is using tasklet and chunk.
The chunk aka step1 is where I am processing data and generate a new csv file in the s3 bucket.
The tasklet aka step2 is where I am reading the csv file from s3 bucket that was generated in step1 and sending SNS topic.
Right now, I have a problem with the SNSTopicSender class that implements Tasklet.
first of all, here are the config class and SNSTopicSender class.
#Bean
Step step1() {
return this.stepBuilderFactory
.get("step1")
.<BadStudent, BadStudent>chunk(100)
.reader(new IteratorItemReader<Student>(this.StudentLoader.Students.iterator()) as ItemReader<? extends BadStudent>)
.processor(this.studentProcessor as ItemProcessor<? super BadStudent, ? extends BadStudent>)
.writer(this.csvWriter())
.build()
}
#Bean
Step step2() {
return this.stepBuilderFactory
.get("step2")
.tasklet(new PublishSnsTopic())
.build()
}
#Bean
Job job() {
return this.jobBuilderFactory
.get("scoring-students-batch")
.incrementer(new RunIdIncrementer())
.start(this.step1())
.next(this.step2())
.build()
}
#Configuration
#Service
class SNSTopicSender implements Tasklet {
#Autowired
ResourceLoader resourceLoader
List<BadStudent> badStudents
#Autowired
FileProperties fileProperties
#PostConstruct
void setup() {
String badStudentCSVFileName = "s3://students/failedStudents.csv"
Reader badStudentReader = new InputStreamReader(
this.resourceLoader.getResource(badStudentCSVFileName).inputStream
)
// create message body and call the publishTopic function
}
void publishTopic(SnsClient snsClient, String message, String arn) {
// sending a topic
}
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
return RepeatStatus.FINISHED
}
}
I used #PostConstruct because without it, resoruceLoader and fileProperties will be null as it would not have Injection of bean as you know.
To work around it, I used #PostConstruct.
However, I recently realized that SNSTopicSender class is not reading the csv file just created, but it is reading the csv file that was already there before this batch job ran. Because the #PostConstruct gets fired before step1 gets completed.(where the target csv file is created).
Then if I removed the #PostConstruct, then resourceLoader and fileProperties will be null, meaning this batch job would not know where the csv file is stored and also how to read it.
Can anyone help me with this please?
I have a job/task to read sub-folders/directory of a given folder/path. The path is dynamic, we get it from Controller. Currently, I have used Tasklet, there are 3 tasklets, one to read sub-directories, another to process it to prepare objects to save to DB and last one to write the processed data objects to a database.
The folders can have any number of sub-folders.Currently, I have used this code :
Path start = Paths.get("x:\\data\\");
Stream<Path> stream = Files.walk(start, 1);
List<String> collect = stream
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
To read all the sub folders at once.
I followed this https://www.baeldung.com/spring-batch-tasklet-chunk example of Tasklet implementation for the purpose. Is this the right approach ? I also need to run the Job asynchronously with multi-threading.
As there can be huge numbers of sub-folders, so there can be huge number of rowsorlist of data to process and write to the database.
Please suggest an appropriate approach.
I am learning Spring Batch, have done few examples on file read/process/write too and used Chunk approach for this.
But my job is to read sub-directories of a folder/path, so I cannot decide which approach to follow.
I have a similar scenario: I need to read all the files from a folder, process and write in db, (Doc)
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Bean
public Job job(JobBuilderFactory jobBuilderFactory,
Step masterStep) {
return jobBuilderFactory.get("MainJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep)
.end()
.build();
}
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
JdbcBatchItemWriter<Transaction> writer,
ItemReader<String> reader,
TransactionItemProcessor processor) {
return stepBuilderFactory.get("Main")
.<String, Transaction>chunk(2)
.reader(reader)
.processor(processor)
.writer(writer)
**.taskExecutor(jobTaskExecutor())**
.listener(new ItemReaderListener())
.build();
}
#Bean
public TaskExecutor jobTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(2);
taskExecutor.setMaxPoolSize(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public ItemReader<String> reader(#Value("#{stepExecution}") StepExecution stepExecution) throws IOException {
Path start = Paths.get("D:\\test");
List<String> inputFile = Files.walk(start, 1)
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
return new IteratorItemReader<>(inputFile);
}
#Bean
#StepScope
public TransactionItemProcessor processor(#Value("#{stepExecution}") StepExecution stepExecution) {
return new TransactionItemProcessor();
}
#Bean
#StepScope
public JdbcBatchItemWriter<Transaction> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<Transaction>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("INSERT INTO transaction (id, date, type) VALUES (:id, :date, :type)")
.dataSource(dataSource)
.build();
}
}
I configured a spring batch to skip a bad record when there is an error reading the xml file. The skipPolicy implementation always return true in order to skip the bad record.
The job need to continue processing the rest of the records, however in my case it stops after the bad record as completed.
#Configuration
#Import(DataSourceConfig.class)
#EnableWebMvc
#ComponentScan(basePackages = "org.nova.batch")
#EnableBatchProcessing
public class BatchIssueConfiguration {
private static final Logger LOG =LoggerFactory.getLogger(BatchIssueConfiguration.class);
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Bean(name = "jobRepository")
public JobRepository jobRepository(DataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
factory.setDatabaseType("derby");
factory.setDataSource(dataSource);
factory.setTransactionManager(transactionManager);
return factory.getObject();
}
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skipPolicy(new FileVerificationSkipper())
.skip(Throwable.class)
.build();
}
#Bean
public SkipPolicy fileVerificationSkipper(){
return new FileVerificationSkipper();
}
#Bean
#JobScope
public MultiResourceItemReader<SigmodRecord.Issue> multiReader() throws IOException{
MultiResourceItemReader<SigmodRecord.Issue> mrir = new MultiResourceItemReader<SigmodRecord.Issue>();
//FileSystemResource [] files = new FileSystemResource [{}];
ResourcePatternResolver rpr = new PathMatchingResourcePatternResolver();
Resource[] resources = rpr.getResources("file:c:/temp/Sigm*.xml");
mrir.setResources( resources);
mrir.setDelegate(xmlItemReader());
return mrir;
}
}
public class FileVerificationSkipper implements SkipPolicy {
private static final Logger LOG = LoggerFactory.getLogger(FileVerificationSkipper.class);
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
LOG.error("There is an error {}",t);
return true;
}
}
The file has inputs which includes "&" that causes the reading error i.e.
<title>Notes of DDTS & n Apparatus for Experimental Research</title>
which throws the following error:
org.springframework.dao.DataAccessResourceFailureException: Error reading XML stream; nested exception is javax.xml.stream.XMLStreamException: ParseError at [row,col]:[127,25]
Message: The entity name must immediately follow the '&' in the entity reference.
Is there anything I'm doing wrong in my configuration that does not allow the rest of the records to continue processing.
To skip for certain type of exceptions we can either mention the skip policy where we can write custom logic for skipping a exception. Like below code.
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipPolicy(new FileVerificationSkipper())
.build();
}
public class FileVerificationSkipper implements SkipPolicy {
private static final Logger LOG = LoggerFactory.getLogger(FileVerificationSkipper.class);
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
LOG.error("There is an error {}",t);
if (t instanceof DataAccessResourceFailureException)
return true;
}
}
Or you can simply setup like below.
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skip(DataAccessResourceFailureException.class)
.build();
}
This issue falls under malformed xml and it seems that there is no way to recover from that except fixing the xml itself. The spring StaxEventItemReader is using XMLEventReader in its low parse of the xml, so I tried to read the xml file using XMLEventReader to try and skip the bad block, however XMLEventReader.nextEvent() kept throwing an exception where the bad block is. I tried to handle that in try catch block in order to skip to next event but it seems that the reader wont move to the next event. So for now the only way to solve the issue is to fix the xml itself before processing it.
I need a spring-batch ItemReader to consume Kafka messages whose results to be processed and written further ahead.
Here's an item reader I have implemented:
public abstract class KafkaItemReader<T> implements ItemReader<List<T>> {
public abstract KafkaConsumer<String, T> getKafkaConsumer();
public abstract String getTopic();
public abstract long getPollingTime();
#Override
public List<T> read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
Iterator<ConsumerRecord<String, T>> iterator = getKafkaConsumer()
.poll(Duration.ofMillis(getPollingTime()))
.records(getTopic())
.iterator();
List<T> records = new ArrayList<>();
while (iterator.hasNext()) {
records.add(iterator.next().value());
}
return records;
}
}
These are the following beans for spring batch job and step:
#Bean
public ItemWriter<List<DbEntity>> databaseWriter(DataSource dataSource) {
//some item writer that needs to be implmented
return null;
}
#Bean
public Step kafkaToDatabaseStep(KafkaItemReader kafkaItemReader, //implementation of KafkaItemReader
StepBuilderFactory stepBuilderFactory,
DataSource dataSource) {
return stepBuilderFactory
.get("kafkaToDatabaseStep")
.allowStartIfComplete(true)
.<List<KafkaRecord>, List<DbEntity>>chunk(100)
.reader(kafkaItemReader)
.processor(itemProcessor()) //List<KafkaRecord> to List<DbEntity> converter
.writer(databaseWriter(dataSource))
.build();
}
#Bean
public Job kafkaToDatabaseJob(
#Qualifier("kafkaToDatabaseStep") Step step) {
return jobBuilderFactory.get("kafkaToDatabaseJob")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
Here I do not know:
How to commit the offset of read messages in the writer as I want to commit only after complete processing of the record.
How to Use JdbcBatchItemWriter as the ItemWriter in my scenario.
The upcoming Spring Batch v4.2 GA will provide support for reading/writing data to Apache Kafka topics. You can already try this out with the 4.2.0.M2 release.
You can also take a look at the Spring Tips installment about it by Josh Long.