Implementation of a step-scoped in-memory destination in spring batch - java

I have a requirement when using spring batch for the creation for a large report, in binary format, which is held in a database. None of the working data can be written directly to files, or to working tables outside the JobExecutionContext.
I'm aware that normally you would just write to the job execution context, but I'm a little confused as to how I would go about it with such a large report (potentially several hundred megabytes.)
At the moment, my Writer implementation has a dependency on an aggregator class, which is injected as a bean, then there's a TaskLet that has the aggregator injected in which writes the finished report to the database.
The problem is that I cannot scope my aggregator to the step context and as such if two jobs are running at the same time they will be writing to the same aggregator.
Here is my current implementation
Domain class
public class DataChunk {
private int pageNumber;
private byte[] data;
}
Writer
public class FooWriter implements ItemWriter<DataChunk> {
private DataChunkAggregator dataChunkAggregator;
public void write(List<? extends DataChunk> dataChunks) throws Exception {
dataChunks.stream().forEach(chunk -> dataChunkAggregator.addChunk(chunk.getPageNumber(), chunk.getData()));
}
}
Aggregator
public class FooAggregator {
private Map<int, byte> pagedData; // Key sorted implementation
public void addChunk(int pageNumber, byte[] data) {
pagedData.put(pageNumber, data)
}
public byte[] aggregate() {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
pagedData.values.stream().forEach(data -> baos.write(data));
return baos.toByteArray();
}
}
Report Writing Tasklet
public class ReportWritingTasklet implements TaskLet {
private ReportRepository reportRepository;
private FooAggregator fooAggregator;
public RepeatStatus execute(StepContribution contribution, ChunkContext context) {
byte[] data = fooAggregator.aggregate();
reportRepository.getOne(reportId).setDataBytes(data);
}
}
Context
<?xml version="1.0" encoding="UTF-8"?>
<beans>
<bean id=fooWriter class="FooWriter" scope="step"
p:fooAggregator-ref="fooAggregator"/>
<bean id="fooAggregator" class="FooAggregator"/>
<bean id="reportWritingTasklet" class="ReportWritingTasklet" scope="step"
p:fooAggregator-ref="fooAggregator"/>
<batch:job id="fooJob">
<batch:step id="generateReport" next="assembleReport">
<batch:chunk reader="fooReader" processor="fooProcessor" writer="fooWriter"/>
</batch:step>
<batch:step id="assembleReport">
<batch:tasklet class="ReportWritingTasklet"/>
</batch:step>
</batch:job>
</beans>
If I attempt to make the FooAggregator step-scoped I get the following exception as a root-cause
Caused by: java.lang.IllegalStateException: Cannot convert value of type [com.sun.proxy.$Proxy98 implementing org.springframework.aop.scope.ScopedObject,java.io.Serializable,org.springframework.aop.framework.AopInfrastructureBean,org.springframework.aop.SpringProxy,org.springframework.aop.framework.Advised] to required type [FooAggregator] for property 'fooAggregator': no matching editors or conversion strategy found
This is because you're only meant to be able to scope certain things to the step.
How can I use the Execution context as a sink for my data chunks, bearing in mind there will be a lot of them and they will be very large?

I've managed to solve this. It's not very Spring batch-y, but it meets my requirements.
Essentially, there is too much data to push into and out of the context. The solution has been to maintain the state in the writer itself, and to make it a StepExecutionListener that will save at the end of the Step, via a TransactionCallback.
The Updated Writer
public class FooWriter extends StepExecutionListenerSupport implements ItemWriter<DataChunk> {
private String reportId;
private Map<Integer, byte[]> byteArrayMap = new ConcurrentSkipListMap<>();
private TransactionTemplate transactionTemplate;
private ReportRepository reportRepository;
#Override
public synchronized void write(List<? extends CaseChunk> caseChunks) throws Exception {
caseChunks.stream().forEach(chunk -> {
byteArrayMap.put(chunk.getPageNumber(), chunk.getBytes());
});
}
#Override
public void beforeStep(StepExecution stepExecution) {
// No-op
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
StringBuilder sb = new StringBuilder();
for (byte[] byteArrayOutputStream : byteArrayMap.values()) {
sb.append(new String(Base64.decode(byteArrayOutputStream)));
}
String encodedReportData = new String(Base64.encode(sb.toString().getBytes()));
TransactionCallback<Report> transactionCallback = transactionStatus -> {
Report report = reportRepository.getOne(this.reportId);
report.setReportData(encodedReportData);
reportRepository.save(report);
return report;
};
// TransactionTemplate throws its own declared TransactionException, rethrows encountered RuntimeExceptions
// and also Errors. Any problem writing the date kills the job, so it's OK to catch Throwable here instead
// of trying to
try {
transactionTemplate.execute(transactionCallback);
} catch (Throwable t) {
LOGGER.error("Error saving report data ID:[{}]", reportId);
return ExitStatus.FAILED.addExitDescription(t);
}
return ExitStatus.COMPLETED;
}
}

Related

Loop Spring Batch

I have a simple job with only one step, but in some way the Batch loops from reader to processor and then to reader again. I can't understand why.
This is the structure:
The reader makes a double select on the same database. The first select needs to search in the first table some records in some state and the second select needs to match those results, get some records from the second table and send them to processor that call an api for every record.
I need to stop the batch running at this point, so after the processor. But I have some problems with this.
Example of my batch:
#Configuration
#EnableBatchProcessing
#EnableScheduling
public class LoadIdemOperationJob {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public JobLauncher jobLauncher;
#Autowired
public JobRegistry jobRegistry;
#Scheduled(cron = "* */3 * * * *")
public void perform() throws Exception {
JobParameters jobParameters = new JobParametersBuilder()
.addString("JobID", String.valueOf(System.currentTimeMillis()))
.toJobParameters();
jobLauncher.run(jobRegistry.getJob("firstJob"), jobParameters);
}
#Bean
public Job firstJob(Step firstStep) {
return jobBuilderFactory.get("firstJob")
.start(firstStep)
.build();
}
#Bean
public Step firstStep(MyReader reader,
MyProcessor processor) {
return stepBuilderFactory.get("firstStep")
.<List<String>, List<String>>chunk(1)
.reader(reader)
.processor(processor)
.writer(new NoOpItemWriter())
.build();
}
#Bean
#StepScope
public MyReader reader(#Value("${hours}") String hours) {
return new MyReader(hours);
}
#Bean
public MyProcessor processor() {
return new MyProcessor();
}
public static class NoOpItemWriter implements ItemWriter<Object> {
#Override
public void write(#NonNull List<?> items) {
}
}
#Bean
public JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor() {
JobRegistryBeanPostProcessor postProcessor = new JobRegistryBeanPostProcessor();
postProcessor.setJobRegistry(jobRegistry);
return postProcessor;
}
#Bean
public RequestContextListener requestContextListener() {
return new RequestContextListener();
}
}
Example of Reader:
public class MyReader implements ItemReader<List<String>> {
public String hours;
private List<String> results;
#Autowired
private JdbcTemplate jdbcTemplate;
public MyReader(String hours) {
this.hours = hours;
}
#Override
public List<String> read() throws Exception {
results = this.jdbcTemplate.queryForList(// 1^ query, String.class);
if (results.isEmpty()) {
return null;
}
List<String> results = this.jdbcTemplate.queryForList(// 2^ query, String.class);
if (results.isEmpty()) {
return null;
}
return results;
}
}
And Processor:
public class MyProcessor implements ItemProcessor<List<String>, List<String>> {
#Override
public List<String> process(#NonNull List<String> results) throws Exception {
results.forEach(result -> // calling service);
return null;
}
}
Thanks for help!
What you are seeing is the implementation of the chunk-oriented processing model of Spring Batch, where items are read and processed in sequence one by one, and written in chunks.
That said, the design and configuration of your chunk-oriented step is not ideal: the reader returns a List of Strings (so an item in your case is the List itself, not an element from the list), the processor loops over the elements of each List (while it is not intended to do so), and finally there is no item writer (this is a sign that either you don't need a chunk-oriented step, or the step is not well designed).
I can recommend to modify your step design as follows:
The reader should return a single item and not a List. For example, by using the iterator of results and make the reader return iterator.next().
Remove the processor and move its code in the item writer. In fact, the item processor is optional in a chunk-oriented step
Create an item writer with the code of the item processor. Posting results to a REST endpoint is in fact a kind of write operation, so an item writer is definitely better suited than an item processor in this case.
With that design, you should see your chunk-oriented step reading and writing all items from your list without the impression that the job is "looping". This is actually the implementation of the pattern described above.

Spring Batch how to set filepath in ExecutionContext for next step

I have a spring batch workflow where I read from a flat csv file and write it a csv file.
This is what my ItemWriter looks like:
#Configuration
public class MyCSVFileWriter implements ItemWriter<RequestModel> {
private final FlatFileItemWriter<RequestModel> writer;
private ExecutionContext jobContext;
public MyCSVFileWriter() throws Exception {
this.writer = new FlatFileItemWriter<>();
DelimitedLineAggregator<RequestModel> lineAggregator = new DelimitedLineAggregator<>();
BeanWrapperFieldExtractor<RequestModel> extractor = new BeanWrapperFieldExtractor<>();
extractor.setNames(new String[]{"id", "source", "date"});
lineAggregator.setFieldExtractor(extractor);
this.writer.setLineAggregator(lineAggregator);
this.writer.setShouldDeleteIfExists(true);
this.writer.afterPropertiesSet();
}
#Override
public void write(List<? extends RequestModel> items) throws Exception {
this.writer.open(jobContext);
this.writer.write(items);
}
#BeforeStep
public void beforeStepHandler(StepExecution stepExecution) {
JobExecution jobExecution = stepExecution.getJobExecution();
jobContext = jobExecution.getExecutionContext();
this.writer.setResource(new FileSystemResource(getRequestOutputPathResource(jobContext)));
}
private String getRequestOutputPathResource(ExecutionContext jobContext) {
//***
return resourcePath;
}
}
I use the executionContext to extract some data used to calculate my resourcePath for my writer.
My next step after writing is uploading the file I have written to in previous step to a remote server. For this I need to store the file path which were calculated and store it ExecutionContext and make it available in the next step.
What is the best way to do this? Should I be doing this in a #AfterStep handler?
Should I be doing this in a #AfterStep handler?
Yes, this is a good option. You can store the path of the file that has been written in the job execution context and read it from there in the next step that uploads the file.

spring batch using spring boot: Read arguments from config or command line and use them in job

I am pretty new to spring technology. I am trying to build an ETL like app using spring batch with spring boot.
Able to run the basic job (read->process->write). Now, I want to read the arguments (like date, file name, type, etc) from a config file (later) or command line (can work with it now) and use them in my job.
Entry point:
// Imports
#SpringBootApplication
#EnableBatchProcessing
public class EtlSpringBatchApplication {
public static void main(String[] args) {
SpringApplication.run(EtlSpringBatchApplication.class, args);
}
}
My batch configuration
// BatchConfig.java
// Imports
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public MyDao myDao;
#Bean
public Job job() {
return jobBuilderFactory
.get("job")
.incrementer(new RunIdIncrementer())
.listener(new Listener(myDao))
.flow(step1())
.end()
.build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("step1").<myModel, myModel>chunk(1000)
.reader(Reader.reader("my_file_20200520.txt"))
.processor(new Processor())
.writer(new Writer(myDao))
.build();
}
I have basic steps steps.
Reader.java has method to read flat file.
public static FlatFileItemReader<MyModel> reader(String path) {......}
Processor.java has process method defined. I added a #BeforeStep to fetch some details from DB required for processing.
public class Processor implements ItemProcessor<MyModel, MyModel> {
private static final Logger log = LoggerFactory.getLogger(Processor.class);
private Long id = null;
#BeforeStep
public void getId(StepExecution stepExecution) {
this.id = stepExecution.getJobExecution().getExecutionContext().getLong("Id");
}
#Override
public MyModel process(MyModel myModel) throws Exception {
}
}
Writer.java is implementing ItemWriter and write code.
Listener.java extends JobExecutionListenerSupport and has overridden methods afterJob and beforeJob.
Basically tried to use executioncontext here in beforeJob.
#Override
public void beforeJob(JobExecution jobExecution) {
log.info("Getting the id..");
this.id = myDao.getLatestId();
log.info("id retrieved is: " + this.id);
jobExecution.getExecutionContext().putLong("Id", this.id);
}
Now, what I am looking for is:
The reader should get the file name from job arguments. i.e. when run the job, I should be able to give some arguments, one of them is file path.
Later some methods (like get id, etc) require few more variables which can be passed as arguments to job i.e. run_date, type, etc.
In short I am looking for a way to,
Pass job arguments to my app (run_date, type, file path etc)
Use them in reader and other places (Listener, Writer)
Can someone provide me what addiitons I should do in my BatchConfig.java and other places, to read the job parameters (from command line or config file, whichever is easy)?
Both Spring Batch and Spring Boot reference documentation show how to pass parameters to a job:
Running Jobs from the Command Line
Running Spring Batch jobs from the Command Line
Moreover, Spring Batch docs explain in details and with code examples how to use those parameters in batch components (like reader, writer, etc):
Late Binding of Job and Step Attributes
You can read the value of the of the job parameters set from the config file inside the reader or other classes within the spring batch execution context. Below is a snippet for reference,
application.yml file can have the below config,
batch.configs.filePath: c:\test
You can add the filePath read from the config to your job parameters when you start the job. Snippet of the class,
// Job and Job Launcher related autowires..
#Value("${batch.configs.filePath}")
private String filePath;
// inside a method block,
JobParameters jobParameters = new JobParametersBuilder().addLong("JobID", System.currentTimeMillis())
.addString("filePath", filePath).toJobParameters();
try {
jobLauncher.run(batchJob, jobParameters);
} catch (Exception e) {
logger.error("Exception while running a batch job {}", e.getMessage());
}
One of the ways to access the Job Parameters is to implement StepExecutionListener to your reader Class to make use of its Overridden methods beforeStep and afterStep. Similar implementations can be performed to other classes as well,
public class Reader implements ItemReader<String>, StepExecutionListener {
private String filePath;
#Override
public void beforeStep(StepExecution stepExecution) {
try {
filePath = (String) stepExecution.getJobExecution().getExecutionContext()
.get("filePath");
} catch (Exception e) {
logger.error("Exception while performing read {}", e);
}
}
#Override
public String read() throws Exception {
// filePath value read from the job execution can be used inside read use case impl
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
return ExitStatus.COMPLETED;
}
}

Spring batch does not continue processing records after xml reading error

I configured a spring batch to skip a bad record when there is an error reading the xml file. The skipPolicy implementation always return true in order to skip the bad record.
The job need to continue processing the rest of the records, however in my case it stops after the bad record as completed.
#Configuration
#Import(DataSourceConfig.class)
#EnableWebMvc
#ComponentScan(basePackages = "org.nova.batch")
#EnableBatchProcessing
public class BatchIssueConfiguration {
private static final Logger LOG =LoggerFactory.getLogger(BatchIssueConfiguration.class);
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Bean(name = "jobRepository")
public JobRepository jobRepository(DataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
factory.setDatabaseType("derby");
factory.setDataSource(dataSource);
factory.setTransactionManager(transactionManager);
return factory.getObject();
}
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skipPolicy(new FileVerificationSkipper())
.skip(Throwable.class)
.build();
}
#Bean
public SkipPolicy fileVerificationSkipper(){
return new FileVerificationSkipper();
}
#Bean
#JobScope
public MultiResourceItemReader<SigmodRecord.Issue> multiReader() throws IOException{
MultiResourceItemReader<SigmodRecord.Issue> mrir = new MultiResourceItemReader<SigmodRecord.Issue>();
//FileSystemResource [] files = new FileSystemResource [{}];
ResourcePatternResolver rpr = new PathMatchingResourcePatternResolver();
Resource[] resources = rpr.getResources("file:c:/temp/Sigm*.xml");
mrir.setResources( resources);
mrir.setDelegate(xmlItemReader());
return mrir;
}
}
public class FileVerificationSkipper implements SkipPolicy {
private static final Logger LOG = LoggerFactory.getLogger(FileVerificationSkipper.class);
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
LOG.error("There is an error {}",t);
return true;
}
}
The file has inputs which includes "&" that causes the reading error i.e.
<title>Notes of DDTS & n Apparatus for Experimental Research</title>
which throws the following error:
org.springframework.dao.DataAccessResourceFailureException: Error reading XML stream; nested exception is javax.xml.stream.XMLStreamException: ParseError at [row,col]:[127,25]
Message: The entity name must immediately follow the '&' in the entity reference.
Is there anything I'm doing wrong in my configuration that does not allow the rest of the records to continue processing.
To skip for certain type of exceptions we can either mention the skip policy where we can write custom logic for skipping a exception. Like below code.
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipPolicy(new FileVerificationSkipper())
.build();
}
public class FileVerificationSkipper implements SkipPolicy {
private static final Logger LOG = LoggerFactory.getLogger(FileVerificationSkipper.class);
#Override
public boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException {
LOG.error("There is an error {}",t);
if (t instanceof DataAccessResourceFailureException)
return true;
}
}
Or you can simply setup like below.
#Bean
public Step stepSGR() throws IOException{
return stepBuilderFactory.get("ETL_STEP").<SigmodRecord.Issue,SigmodRecord.Issue>chunk(1)
//.processor(itemProcessor())
.writer(itemWriter())
.reader(multiReader())
.faultTolerant()
.skipLimit(Integer.MAX_VALUE)
.skip(DataAccessResourceFailureException.class)
.build();
}
This issue falls under malformed xml and it seems that there is no way to recover from that except fixing the xml itself. The spring StaxEventItemReader is using XMLEventReader in its low parse of the xml, so I tried to read the xml file using XMLEventReader to try and skip the bad block, however XMLEventReader.nextEvent() kept throwing an exception where the bad block is. I tried to handle that in try catch block in order to skip to next event but it seems that the reader wont move to the next event. So for now the only way to solve the issue is to fix the xml itself before processing it.

How to get Job parameteres in to item processor using spring Batch annotation

I am using spring MVC. From my controller, I am calling jobLauncher and in jobLauncher I am passing job parameters like below and I'm using annotations to enable configuration as below:
#Configuration
#EnableBatchProcessing
public class BatchConfiguration {
// read, write ,process and invoke job
}
JobParameters jobParameters = new JobParametersBuilder().addString("fileName", "xxxx.txt").toJobParameters();
stasrtjob = jobLauncher.run(job, jobParameters);
and here is my itemprocessor
public class DataItemProcessor implements ItemProcessor<InputData, OutPutData> {
public OutPutData process(final InputData inputData) throws Exception {
// i want to get job Parameters here ????
}
}
1) Put a scope annotation on your data processor i.e.
#Scope(value = "step")
2) Make a class instance in your data processor and inject the job parameter value by using value annotation :
#Value("#{jobParameters['fileName']}")
private String fileName;
Your final Data processor class will look like:
#Scope(value = "step")
public class DataItemProcessor implements ItemProcessor<InputData, OutPutData> {
#Value("#{jobParameters['fileName']}")
private String fileName;
public OutPutData process(final InputData inputData) throws Exception {
// i want to get job Parameters here ????
System.out.println("Job parameter:"+fileName);
}
public void setFileName(String fileName) {
this.fileName = fileName;
}
}
In case your data processor is not initialized as a bean, put a #Component annotation on it:
#Component("dataItemProcessor")
#Scope(value = "step")
public class DataItemProcessor implements ItemProcessor<InputData, OutPutData> {
A better solution (in my opinion) that avoids using Spring's hacky expression language (SpEL) is to autowire the StepExecution context into your processor using #BeforeStep.
In your processor, add something like:
#BeforeStep
public void beforeStep(final StepExecution stepExecution) {
JobParameters jobParameters = stepExecution.getJobParameters();
// Do stuff with job parameters, e.g. set class-scoped variables, etc.
}
The #BeforeStep annotation
Marks a method to be called before a Step is executed, which comes
after a StepExecution is created and persisted, but before the first
item is read.
I have written the in the process itself, rather then creating separate file using the lambda expression.
#Bean
#StepScope
public ItemProcessor<SampleTable, SampleTable> processor(#Value("#{jobParameters['eventName']}") String eventName) {
//return new RandomNumberProcessor();
return item -> {
SampleTable dataSample = new SampleTable();
if(data.contains(item)) {
return null;
}
else {
dataSample.setMobileNo(item.getMobileNo());
dataSample.setEventId(eventName);
return dataSample;
}
};
}

Categories