I have a job/task to read sub-folders/directory of a given folder/path. The path is dynamic, we get it from Controller. Currently, I have used Tasklet, there are 3 tasklets, one to read sub-directories, another to process it to prepare objects to save to DB and last one to write the processed data objects to a database.
The folders can have any number of sub-folders.Currently, I have used this code :
Path start = Paths.get("x:\\data\\");
Stream<Path> stream = Files.walk(start, 1);
List<String> collect = stream
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
To read all the sub folders at once.
I followed this https://www.baeldung.com/spring-batch-tasklet-chunk example of Tasklet implementation for the purpose. Is this the right approach ? I also need to run the Job asynchronously with multi-threading.
As there can be huge numbers of sub-folders, so there can be huge number of rowsorlist of data to process and write to the database.
Please suggest an appropriate approach.
I am learning Spring Batch, have done few examples on file read/process/write too and used Chunk approach for this.
But my job is to read sub-directories of a folder/path, so I cannot decide which approach to follow.
I have a similar scenario: I need to read all the files from a folder, process and write in db, (Doc)
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Bean
public Job job(JobBuilderFactory jobBuilderFactory,
Step masterStep) {
return jobBuilderFactory.get("MainJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep)
.end()
.build();
}
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
JdbcBatchItemWriter<Transaction> writer,
ItemReader<String> reader,
TransactionItemProcessor processor) {
return stepBuilderFactory.get("Main")
.<String, Transaction>chunk(2)
.reader(reader)
.processor(processor)
.writer(writer)
**.taskExecutor(jobTaskExecutor())**
.listener(new ItemReaderListener())
.build();
}
#Bean
public TaskExecutor jobTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(2);
taskExecutor.setMaxPoolSize(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public ItemReader<String> reader(#Value("#{stepExecution}") StepExecution stepExecution) throws IOException {
Path start = Paths.get("D:\\test");
List<String> inputFile = Files.walk(start, 1)
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
return new IteratorItemReader<>(inputFile);
}
#Bean
#StepScope
public TransactionItemProcessor processor(#Value("#{stepExecution}") StepExecution stepExecution) {
return new TransactionItemProcessor();
}
#Bean
#StepScope
public JdbcBatchItemWriter<Transaction> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<Transaction>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("INSERT INTO transaction (id, date, type) VALUES (:id, :date, :type)")
.dataSource(dataSource)
.build();
}
}
Related
I have a simple job with only one step, but in some way the Batch loops from reader to processor and then to reader again. I can't understand why.
This is the structure:
The reader makes a double select on the same database. The first select needs to search in the first table some records in some state and the second select needs to match those results, get some records from the second table and send them to processor that call an api for every record.
I need to stop the batch running at this point, so after the processor. But I have some problems with this.
Example of my batch:
#Configuration
#EnableBatchProcessing
#EnableScheduling
public class LoadIdemOperationJob {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public JobLauncher jobLauncher;
#Autowired
public JobRegistry jobRegistry;
#Scheduled(cron = "* */3 * * * *")
public void perform() throws Exception {
JobParameters jobParameters = new JobParametersBuilder()
.addString("JobID", String.valueOf(System.currentTimeMillis()))
.toJobParameters();
jobLauncher.run(jobRegistry.getJob("firstJob"), jobParameters);
}
#Bean
public Job firstJob(Step firstStep) {
return jobBuilderFactory.get("firstJob")
.start(firstStep)
.build();
}
#Bean
public Step firstStep(MyReader reader,
MyProcessor processor) {
return stepBuilderFactory.get("firstStep")
.<List<String>, List<String>>chunk(1)
.reader(reader)
.processor(processor)
.writer(new NoOpItemWriter())
.build();
}
#Bean
#StepScope
public MyReader reader(#Value("${hours}") String hours) {
return new MyReader(hours);
}
#Bean
public MyProcessor processor() {
return new MyProcessor();
}
public static class NoOpItemWriter implements ItemWriter<Object> {
#Override
public void write(#NonNull List<?> items) {
}
}
#Bean
public JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor() {
JobRegistryBeanPostProcessor postProcessor = new JobRegistryBeanPostProcessor();
postProcessor.setJobRegistry(jobRegistry);
return postProcessor;
}
#Bean
public RequestContextListener requestContextListener() {
return new RequestContextListener();
}
}
Example of Reader:
public class MyReader implements ItemReader<List<String>> {
public String hours;
private List<String> results;
#Autowired
private JdbcTemplate jdbcTemplate;
public MyReader(String hours) {
this.hours = hours;
}
#Override
public List<String> read() throws Exception {
results = this.jdbcTemplate.queryForList(// 1^ query, String.class);
if (results.isEmpty()) {
return null;
}
List<String> results = this.jdbcTemplate.queryForList(// 2^ query, String.class);
if (results.isEmpty()) {
return null;
}
return results;
}
}
And Processor:
public class MyProcessor implements ItemProcessor<List<String>, List<String>> {
#Override
public List<String> process(#NonNull List<String> results) throws Exception {
results.forEach(result -> // calling service);
return null;
}
}
Thanks for help!
What you are seeing is the implementation of the chunk-oriented processing model of Spring Batch, where items are read and processed in sequence one by one, and written in chunks.
That said, the design and configuration of your chunk-oriented step is not ideal: the reader returns a List of Strings (so an item in your case is the List itself, not an element from the list), the processor loops over the elements of each List (while it is not intended to do so), and finally there is no item writer (this is a sign that either you don't need a chunk-oriented step, or the step is not well designed).
I can recommend to modify your step design as follows:
The reader should return a single item and not a List. For example, by using the iterator of results and make the reader return iterator.next().
Remove the processor and move its code in the item writer. In fact, the item processor is optional in a chunk-oriented step
Create an item writer with the code of the item processor. Posting results to a REST endpoint is in fact a kind of write operation, so an item writer is definitely better suited than an item processor in this case.
With that design, you should see your chunk-oriented step reading and writing all items from your list without the impression that the job is "looping". This is actually the implementation of the pattern described above.
First of all, thank you for visiting my post.
I have a Spring Batch Project that is using tasklet and chunk.
The chunk aka step1 is where I am processing data and generate a new csv file in the s3 bucket.
The tasklet aka step2 is where I am reading the csv file from s3 bucket that was generated in step1 and sending SNS topic.
Right now, I have a problem with the SNSTopicSender class that implements Tasklet.
first of all, here are the config class and SNSTopicSender class.
#Bean
Step step1() {
return this.stepBuilderFactory
.get("step1")
.<BadStudent, BadStudent>chunk(100)
.reader(new IteratorItemReader<Student>(this.StudentLoader.Students.iterator()) as ItemReader<? extends BadStudent>)
.processor(this.studentProcessor as ItemProcessor<? super BadStudent, ? extends BadStudent>)
.writer(this.csvWriter())
.build()
}
#Bean
Step step2() {
return this.stepBuilderFactory
.get("step2")
.tasklet(new PublishSnsTopic())
.build()
}
#Bean
Job job() {
return this.jobBuilderFactory
.get("scoring-students-batch")
.incrementer(new RunIdIncrementer())
.start(this.step1())
.next(this.step2())
.build()
}
#Configuration
#Service
class SNSTopicSender implements Tasklet {
#Autowired
ResourceLoader resourceLoader
List<BadStudent> badStudents
#Autowired
FileProperties fileProperties
#PostConstruct
void setup() {
String badStudentCSVFileName = "s3://students/failedStudents.csv"
Reader badStudentReader = new InputStreamReader(
this.resourceLoader.getResource(badStudentCSVFileName).inputStream
)
// create message body and call the publishTopic function
}
void publishTopic(SnsClient snsClient, String message, String arn) {
// sending a topic
}
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
return RepeatStatus.FINISHED
}
}
I used #PostConstruct because without it, resoruceLoader and fileProperties will be null as it would not have Injection of bean as you know.
To work around it, I used #PostConstruct.
However, I recently realized that SNSTopicSender class is not reading the csv file just created, but it is reading the csv file that was already there before this batch job ran. Because the #PostConstruct gets fired before step1 gets completed.(where the target csv file is created).
Then if I removed the #PostConstruct, then resourceLoader and fileProperties will be null, meaning this batch job would not know where the csv file is stored and also how to read it.
Can anyone help me with this please?
I need a spring-batch ItemReader to consume Kafka messages whose results to be processed and written further ahead.
Here's an item reader I have implemented:
public abstract class KafkaItemReader<T> implements ItemReader<List<T>> {
public abstract KafkaConsumer<String, T> getKafkaConsumer();
public abstract String getTopic();
public abstract long getPollingTime();
#Override
public List<T> read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
Iterator<ConsumerRecord<String, T>> iterator = getKafkaConsumer()
.poll(Duration.ofMillis(getPollingTime()))
.records(getTopic())
.iterator();
List<T> records = new ArrayList<>();
while (iterator.hasNext()) {
records.add(iterator.next().value());
}
return records;
}
}
These are the following beans for spring batch job and step:
#Bean
public ItemWriter<List<DbEntity>> databaseWriter(DataSource dataSource) {
//some item writer that needs to be implmented
return null;
}
#Bean
public Step kafkaToDatabaseStep(KafkaItemReader kafkaItemReader, //implementation of KafkaItemReader
StepBuilderFactory stepBuilderFactory,
DataSource dataSource) {
return stepBuilderFactory
.get("kafkaToDatabaseStep")
.allowStartIfComplete(true)
.<List<KafkaRecord>, List<DbEntity>>chunk(100)
.reader(kafkaItemReader)
.processor(itemProcessor()) //List<KafkaRecord> to List<DbEntity> converter
.writer(databaseWriter(dataSource))
.build();
}
#Bean
public Job kafkaToDatabaseJob(
#Qualifier("kafkaToDatabaseStep") Step step) {
return jobBuilderFactory.get("kafkaToDatabaseJob")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
Here I do not know:
How to commit the offset of read messages in the writer as I want to commit only after complete processing of the record.
How to Use JdbcBatchItemWriter as the ItemWriter in my scenario.
The upcoming Spring Batch v4.2 GA will provide support for reading/writing data to Apache Kafka topics. You can already try this out with the 4.2.0.M2 release.
You can also take a look at the Spring Tips installment about it by Josh Long.
I am using Spring boot 2.0.5.RELEASE and running a batch process using this:
# prevent auto-start of batch jobs
spring:
batch:
job:
enabled: false
and triggering it manually using a controller end-point (in input are the parameters that are collection from user from the controller):
jobLauncher.run(job, new JobParametersBuilder()
.addDate("date", new Date())
.addJobParameters(new JobParameters(input)).toJobParameters());
Here is my batch configuration:
#Bean
public MongoItemReader<Document> reader() {
MongoItemReader<Document> reader = new MongoItemReader<>();
reader.setTemplate(mongoTemplate);
reader.setCollection(XML_PERSIST_COLLECTION);
reader.setQuery("{}");
Map<String, Sort.Direction> sorts = new HashMap<>(1);
sorts.put("status", Sort.Direction.ASC);
reader.setSort(sorts);
reader.setTargetType(Document.class);
return reader;
}
#Bean
#StepScope
public MyItemProcessor processor() {
return new MyItemProcessor();
}
#Bean
public MongoItemWriter<OutputDto> writer() {
MongoItemWriter<OutputDto> writer = new MongoItemWriter<>();
writer.setTemplate(mongoTemplate);
writer.setCollection(RESPONSE_COLLECTION);
return writer;
}
#Bean
public Step step() {
return stepBuilderFactory.get("step")
.<Document, OutputDto> chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.allowStartIfComplete(true)
.build();
#Bean
public Job job(Step step) {
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
and my processor:
public class MyItemProcessor implements ItemProcessor<Document, OutputDto> {
#Value("#{jobParameters['username']}")
private String username;
#Value("#{jobParameters['password']}")
private String password;
#Override
public OutputDto process(final Document document) throws Exception {
// implementation code
}
}
I am using #StepScope for the processor to extract the job parameters that are passed from my controller.
Issue:
Everything is fine except that the batch job will run only once after the app starts and it will not run again (it runs, but I tried keeping debug points in processor and it is not getting there). I am already adding a timestamp job parameter so that the batch job can be run again, yet the processor is not running more than once (when it should). Any ideas?
The reader() and writer() had singleton scope while the processor() had #StepScope - so looks like that's why the writer() was not getting invoked.
I added #StepScope to the reader and writer and now everything is working fine, though it didn't strike me as instinctive - should have worked without that in 2.0.5.RELEASE.
Using Spring Batch 3.0.4.RELEASE.
I configure a job to use a partition step. The slave step uses chunk size 1. There are six threads in the task executor. I run this test with various grid sizes from six to hundreds. My grid size is the number of slave StepExecutions I expect == the number of ExecutionContexts created by my partitioner.
The result is always this:
The six threads pick up six different step executions and execute them successfully. Then the same six step executions run again and again in the same thread!
I notice that there is a loop in RepeatTemplate.executeInternal(...) that never ends. It keeps executing the same StepExecution just incrementing the version.
Here's the Java configuration code:
#Bean
#StepScope
public RapRequestItemReader rapReader(
#Value("#{stepExecutionContext['" + RapJobConfig.LIST_OF_IDS_STEP_EXECUTION_CONTEXT_VAR + "']}") String listOfIds,
final #Value("#{stepExecutionContext['" + RapJobConfig.TIME_STEP_EXECUTION_CONTEXT_VAR + "']}") String timeString) {
final List<Asset> farms = Arrays.asList(listOfIds.split(",")).stream().map(intString -> assetDao.getById(Integer.valueOf(intString)))
.collect(Collectors.toList());
return new RapRequestItemReader(timeString, farms);
}
#Bean
public ItemProcessor<RapRequest, PullSuccess> rapProcessor() {
return rapRequest -> {
return rapPull.pull(rapRequest.timestamp, rapRequest.farms);
};
}
#Bean
public TaskletStep rapStep1(StepBuilderFactory stepBuilderFactory, RapRequestItemReader rapReader) {
return stepBuilderFactory.get(RAP_STEP_NAME)
.<RapRequest, PullSuccess> chunk(RAP_STEP_CHUNK_SIZE)
.reader(rapReader)
.processor(rapProcessor())
.writer(updateCoverageWriter)
.build();
}
private RapFilePartitioner createRapFilePartitioner(RapParameter rapParameter) {
RapFilePartitioner partitioner = new RapFilePartitioner(rapParameter, rapPull.getIncrementHours());
return partitioner;
}
#Bean
public ThreadPoolTaskExecutor pullExecutor() {
ThreadPoolTaskExecutor pullExecutor = new ThreadPoolTaskExecutor();
pullExecutor.setCorePoolSize(weatherConfig.getNumberOfThreadsPerModelType());
pullExecutor.setMaxPoolSize(weatherConfig.getNumberOfThreadsPerModelType());
pullExecutor.setAllowCoreThreadTimeOut(true);
return pullExecutor;
}
#Bean
#JobScope
public Step rapPartitionByTimestampStep(StepBuilderFactory stepBuilderFactory, #Value("#{jobParameters['config']}") String config,
TaskletStep rapStep1) {
RapParameter rapParameter = GsonHelper.fromJson(config, RapParameter.class);
int gridSize = calculateGridSize(rapParameter);
return stepBuilderFactory.get("rapPartitionByTimestampStep")
.partitioner(rapStep1)
.partitioner(RAP_STEP_NAME, createRapFilePartitioner(rapParameter))
.taskExecutor(pullExecutor())
.gridSize(gridSize)
.build();
}
#Bean
public Job rapJob(JobBuilderFactory jobBuilderFactory, Step rapPartitionByTimestampStep) {
return jobBuilderFactory.get(JOB_NAME)
.start(rapPartitionByTimestampStep)
.build();
}
Though it's hard to tell this from the question, the problem was in the reader. The ItemReader was never returning null.
In the design, a StepExecution was supposed to process only one item. However, after processing that item, the ItemReader was returning that same item again instead of returning null.
I fixed it by having the ItemReader return null the second time read is called.
A better design might be to use a TaskletStep instead of a ChunkStep.