I've a Spring Batch process that read Report objects from a CSV and insert Analytic objects into a MySQL DB correctly, but the logical has changed for a more than one Analytics insert for each Report readed.
I'm new in Spring Batch and the actually process was very difficult for me, and I don't know how to do this change.
I haven't XML configuration, all is with annotations. Report and Analytics classes have a getter and a setter for two fields, adId and value. The new logic has seven values for an adId and I need to insert seven rows into table.
I hide, delete or supress some code that not contribute for the question.
Here is my BatchConfiguration.java:
#Configuration
#EnableBatchProcessingpublic
class BatchConfiguration {
#Autowired
private transient JobBuilderFactory jobBuilderFactory;
#Autowired
private transient StepBuilderFactory stepBuilderFactory;
#Autowired
private transient DataSource dataSource;
public FlatFileItemReader<Report> reader() {
// The reader from the CSV works fine.
}
#Bean
public JdbcBatchItemWriter<Analytic> writer() {
final JdbcBatchItemWriter<Analytic> writer = new JdbcBatchItemWriter<Analytic>();
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<Analytic>());
writer.setSql("INSERT INTO TABLE (ad_id, value) VALUES (:adId, :value)");
writer.setDataSource(dataSource);
return writer;
}
#Bean
public AnalyticItemProcessor processor() {
return new AnalyticItemProcessor();
}
#Bean
public Step step() {
return stepBuilderFactory.get("step1").<Report, Analytic> chunk(10000).reader(reader()).processor(processor()).writer(writer()).build();
}
#Bean
public Job process() {
final JobBuilder jobBuilder = jobBuilderFactory.get("process");
return jobBuilder.start(step()).build();
}
}
Then the AnalyticItemProcessor.java
public class AnalyticItemProcessor implements ItemProcessor<Report, Analytic> {
#Override
public Analytic process(final Report report) {
// Creates a new Analytic call BeanUtils.copyProperties(report, analytic) and returns analytic.
}
}
And the Process:
#SpringBootApplication
public class Process {
public static void main(String[] args) throws Exception {
SpringApplication.run(Process.class, args);
}
}
How can I do this change? Maybe with ItemPreparedStatementSetter or ItemSqlParameterSourceProvider? Thanks.
If I'm understanding your question correctly, you can use the CompositeItemWriter to wrap multiple JdbcBatchItemWriter instances (one per insert you need to accomplish). That would allow you to insert multiple rows per item. Otherwise, you'd need to write your own ItemWriter implementation.
Related
I have a simple job with only one step, but in some way the Batch loops from reader to processor and then to reader again. I can't understand why.
This is the structure:
The reader makes a double select on the same database. The first select needs to search in the first table some records in some state and the second select needs to match those results, get some records from the second table and send them to processor that call an api for every record.
I need to stop the batch running at this point, so after the processor. But I have some problems with this.
Example of my batch:
#Configuration
#EnableBatchProcessing
#EnableScheduling
public class LoadIdemOperationJob {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public JobLauncher jobLauncher;
#Autowired
public JobRegistry jobRegistry;
#Scheduled(cron = "* */3 * * * *")
public void perform() throws Exception {
JobParameters jobParameters = new JobParametersBuilder()
.addString("JobID", String.valueOf(System.currentTimeMillis()))
.toJobParameters();
jobLauncher.run(jobRegistry.getJob("firstJob"), jobParameters);
}
#Bean
public Job firstJob(Step firstStep) {
return jobBuilderFactory.get("firstJob")
.start(firstStep)
.build();
}
#Bean
public Step firstStep(MyReader reader,
MyProcessor processor) {
return stepBuilderFactory.get("firstStep")
.<List<String>, List<String>>chunk(1)
.reader(reader)
.processor(processor)
.writer(new NoOpItemWriter())
.build();
}
#Bean
#StepScope
public MyReader reader(#Value("${hours}") String hours) {
return new MyReader(hours);
}
#Bean
public MyProcessor processor() {
return new MyProcessor();
}
public static class NoOpItemWriter implements ItemWriter<Object> {
#Override
public void write(#NonNull List<?> items) {
}
}
#Bean
public JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor() {
JobRegistryBeanPostProcessor postProcessor = new JobRegistryBeanPostProcessor();
postProcessor.setJobRegistry(jobRegistry);
return postProcessor;
}
#Bean
public RequestContextListener requestContextListener() {
return new RequestContextListener();
}
}
Example of Reader:
public class MyReader implements ItemReader<List<String>> {
public String hours;
private List<String> results;
#Autowired
private JdbcTemplate jdbcTemplate;
public MyReader(String hours) {
this.hours = hours;
}
#Override
public List<String> read() throws Exception {
results = this.jdbcTemplate.queryForList(// 1^ query, String.class);
if (results.isEmpty()) {
return null;
}
List<String> results = this.jdbcTemplate.queryForList(// 2^ query, String.class);
if (results.isEmpty()) {
return null;
}
return results;
}
}
And Processor:
public class MyProcessor implements ItemProcessor<List<String>, List<String>> {
#Override
public List<String> process(#NonNull List<String> results) throws Exception {
results.forEach(result -> // calling service);
return null;
}
}
Thanks for help!
What you are seeing is the implementation of the chunk-oriented processing model of Spring Batch, where items are read and processed in sequence one by one, and written in chunks.
That said, the design and configuration of your chunk-oriented step is not ideal: the reader returns a List of Strings (so an item in your case is the List itself, not an element from the list), the processor loops over the elements of each List (while it is not intended to do so), and finally there is no item writer (this is a sign that either you don't need a chunk-oriented step, or the step is not well designed).
I can recommend to modify your step design as follows:
The reader should return a single item and not a List. For example, by using the iterator of results and make the reader return iterator.next().
Remove the processor and move its code in the item writer. In fact, the item processor is optional in a chunk-oriented step
Create an item writer with the code of the item processor. Posting results to a REST endpoint is in fact a kind of write operation, so an item writer is definitely better suited than an item processor in this case.
With that design, you should see your chunk-oriented step reading and writing all items from your list without the impression that the job is "looping". This is actually the implementation of the pattern described above.
I am pretty new to spring technology. I am trying to build an ETL like app using spring batch with spring boot.
Able to run the basic job (read->process->write). Now, I want to read the arguments (like date, file name, type, etc) from a config file (later) or command line (can work with it now) and use them in my job.
Entry point:
// Imports
#SpringBootApplication
#EnableBatchProcessing
public class EtlSpringBatchApplication {
public static void main(String[] args) {
SpringApplication.run(EtlSpringBatchApplication.class, args);
}
}
My batch configuration
// BatchConfig.java
// Imports
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public MyDao myDao;
#Bean
public Job job() {
return jobBuilderFactory
.get("job")
.incrementer(new RunIdIncrementer())
.listener(new Listener(myDao))
.flow(step1())
.end()
.build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("step1").<myModel, myModel>chunk(1000)
.reader(Reader.reader("my_file_20200520.txt"))
.processor(new Processor())
.writer(new Writer(myDao))
.build();
}
I have basic steps steps.
Reader.java has method to read flat file.
public static FlatFileItemReader<MyModel> reader(String path) {......}
Processor.java has process method defined. I added a #BeforeStep to fetch some details from DB required for processing.
public class Processor implements ItemProcessor<MyModel, MyModel> {
private static final Logger log = LoggerFactory.getLogger(Processor.class);
private Long id = null;
#BeforeStep
public void getId(StepExecution stepExecution) {
this.id = stepExecution.getJobExecution().getExecutionContext().getLong("Id");
}
#Override
public MyModel process(MyModel myModel) throws Exception {
}
}
Writer.java is implementing ItemWriter and write code.
Listener.java extends JobExecutionListenerSupport and has overridden methods afterJob and beforeJob.
Basically tried to use executioncontext here in beforeJob.
#Override
public void beforeJob(JobExecution jobExecution) {
log.info("Getting the id..");
this.id = myDao.getLatestId();
log.info("id retrieved is: " + this.id);
jobExecution.getExecutionContext().putLong("Id", this.id);
}
Now, what I am looking for is:
The reader should get the file name from job arguments. i.e. when run the job, I should be able to give some arguments, one of them is file path.
Later some methods (like get id, etc) require few more variables which can be passed as arguments to job i.e. run_date, type, etc.
In short I am looking for a way to,
Pass job arguments to my app (run_date, type, file path etc)
Use them in reader and other places (Listener, Writer)
Can someone provide me what addiitons I should do in my BatchConfig.java and other places, to read the job parameters (from command line or config file, whichever is easy)?
Both Spring Batch and Spring Boot reference documentation show how to pass parameters to a job:
Running Jobs from the Command Line
Running Spring Batch jobs from the Command Line
Moreover, Spring Batch docs explain in details and with code examples how to use those parameters in batch components (like reader, writer, etc):
Late Binding of Job and Step Attributes
You can read the value of the of the job parameters set from the config file inside the reader or other classes within the spring batch execution context. Below is a snippet for reference,
application.yml file can have the below config,
batch.configs.filePath: c:\test
You can add the filePath read from the config to your job parameters when you start the job. Snippet of the class,
// Job and Job Launcher related autowires..
#Value("${batch.configs.filePath}")
private String filePath;
// inside a method block,
JobParameters jobParameters = new JobParametersBuilder().addLong("JobID", System.currentTimeMillis())
.addString("filePath", filePath).toJobParameters();
try {
jobLauncher.run(batchJob, jobParameters);
} catch (Exception e) {
logger.error("Exception while running a batch job {}", e.getMessage());
}
One of the ways to access the Job Parameters is to implement StepExecutionListener to your reader Class to make use of its Overridden methods beforeStep and afterStep. Similar implementations can be performed to other classes as well,
public class Reader implements ItemReader<String>, StepExecutionListener {
private String filePath;
#Override
public void beforeStep(StepExecution stepExecution) {
try {
filePath = (String) stepExecution.getJobExecution().getExecutionContext()
.get("filePath");
} catch (Exception e) {
logger.error("Exception while performing read {}", e);
}
}
#Override
public String read() throws Exception {
// filePath value read from the job execution can be used inside read use case impl
}
#Override
public ExitStatus afterStep(StepExecution stepExecution) {
return ExitStatus.COMPLETED;
}
}
Below is the configuration of my spring batch job which takes records from DB, do some processing in item processor, updates status column and writes back to DB.
When I ran for 10k records, I could see its taking every record one by one and updating status in the same manner. Initially I was planning to use multithreading but it doesn't make any sense as my job runs once in a day with number of records ranging from 10 to 100k. ( Records are less than 5k in most of the days and a very few days in a year ( 5 to 10 days) it comes to 50k to 100k).
I don't want to add more cpus and getting charged by Kubernetes just for 10 days of an year. Now the problem is when I ran this job, it takes only 100 records that it runs every select query independently instead of taking 100 at a time. Also update is also one record at a time and it takes 10 mins to process 10k records which is really slow.
How can do a faster read, process and write? I can get rid of multithreading and have a bit more of CPU utilization once in a while. More information is given as comments in code.
#Configuration
#EnableBatchProcessing
public class BatchConfiguration extends DefaultBatchConfigurer{
public final static Logger logger = LoggerFactory.getLogger(BatchConfiguration.class);
#Autowired
JobBuilderFactory jobBuilderFactory;
#Autowired
StepBuilderFactory stepBuilderFactory;
#Autowired
MyRepository myRepository;
#Autowired
private EntityManagerFactory entityManagerFactory;
#Value("${chunk-size}")
private int chunkSize;
#Value("${max-threads}")
private int maxThreads;
private final DataSource dataSource;
/**
* #param dataSource
* Override to do not set datasource even if a datasource exist during intialization.
* Initialize will use a Map based JobRepository (instead of database) for Spring batch meta tables
*/
#Override
public void setDataSource(DataSource dataSource) {
}
#Override
public PlatformTransactionManager getTransactionManager() {
return jpaTransactionManager();
}
#Autowired
public BatchConfiguration(#Qualifier("dataSource") DataSource dataSource) {
this.dataSource = dataSource;
}
#Bean
public JpaTransactionManager jpaTransactionManager() {
final JpaTransactionManager transactionManager = new JpaTransactionManager();
transactionManager.setDataSource(dataSource);
return transactionManager;
}
#Bean
#StepScope
public JdbcPagingItemReader<ModelEntity> importReader() { // I tried using RepositoryItemReader but records were skipped by JPA hence I went for JdbcPagingItemReader
JdbcPagingItemReader<ModelEntity> reader = new JdbcPagingItemReader<ModelEntity>();
final SqlPagingQueryProviderFactoryBean sqlPagingQueryProviderFactoryBean = new SqlPagingQueryProviderFactoryBean();
sqlPagingQueryProviderFactoryBean.setDataSource( dataSource );
sqlPagingQueryProviderFactoryBean.setSelectClause( "SELECT *" );
sqlPagingQueryProviderFactoryBean.setFromClause( "FROM mytable" );
sqlPagingQueryProviderFactoryBean.setWhereClause( "WHERE STATUS = 'myvalue' " );
sqlPagingQueryProviderFactoryBean.setSortKey( "primarykey" );
try {
reader.setQueryProvider( sqlPagingQueryProviderFactoryBean.getObject() );
} catch (Exception e) {
e.printStackTrace();
}
reader.setDataSource( dataSource );
reader.setPageSize( chunkSize );
reader.setSaveState( Boolean.FALSE );
reader.setRowMapper( new BeanPropertyRowMapper<ModelEntity>(ModelEntity.class ) );
return reader;
}
#Bean
public ItemWriter<ModelEntity> databaseWriter() {
RepositoryItemWriter<ModelEntity> repositoryItemWriter=new RepositoryItemWriter<>();
repositoryItemWriter.setRepository(myRepository);
repositoryItemWriter.setMethodName("save");
return repositoryItemWriter;
}
#Bean
public Myprocessor myprocessor() {
return new Myprocessor();
}
#Bean
public JobExecutionListener jobExecutionListener() {
return new JobExecutionListener();
}
#Bean
public StepExecutionListener stepExecutionListener() {
return new StepExecutionListener();
}
#Bean
public ChunkExecutionListener chunkListener() {
return new ChunkExecutionListener();
}
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(maxThreads);
return taskExecutor;
}
#Bean
public Job processJob() {
return jobBuilderFactory.get("myjob")
.incrementer(new RunIdIncrementer())
.start(processStep())
.listener(jobExecutionListener())
.build();
}
#Bean
public Step processStep() {
return stepBuilderFactory.get("processStep")
.<ModelEntity,ModelEntity>chunk(chunkSize)
.reader(importReader())
.processor(myprocessor())
.writer(databaseWriter())
.taskExecutor(taskExecutor())
.listener(stepExecutionListener())
.listener(chunkListener())
.transactionManager(getTransactionManager())
.throttleLimit(maxThreads)
.build();
}
}
Repository that I am using is JpaRepository and code below. (Assuming save method of its parent class CrudRepository will do save)
public interface MyRepository extends JpaRepository<ModelEntity, BigInteger> {
}
Processor is as below
#Component
public class Myprocessor implements ItemProcessor<Myprocessor,Myprocessor> {
#Override
public ModelEntity process(ModelEntity modelEntity) throws Exception {
try {
// This is fast and working fine
if ((myProcessing)) {
modelEntity.setStatus(success);
} else {
modelEntity.setStatus(failed);
}
}
catch (Exception e){
logger.info( "Exception occurred while processing"+e );
}
return modelEntity;
}
// This is fast and working fine
public Boolean myProcessing(ModelEntity modelEntity){
//Processor Logic Here
return processingStatus;
}
}
Properties file below
logging.level.org.hibernate.SQL=DEBUG
logging.level.com.zaxxer.hikari.HikariConfig=DEBUG
logging.level.org.hibernate.type.descriptor.sql.BasicBinder=TRACE
logging.level.org.springframework.jdbc.core.JdbcTemplate=DEBUG
logging.level.org.springframework.jdbc.core.StatementCreatorUtils=TRACE
spring.datasource.type=com.zaxxer.hikari.HikariDataSource
spring.datasource.url=url
spring.datasource.username=username
spring.datasource.password=password
spring.jpa.hibernate.connection.provider_class
=org.hibernate.hikaricp.internal.HikariCPConnectionProvider
spring.jpa.database-platform=org.hibernate.dialect.Oracle10gDialect
spring.jpa.show-sql=false
spring.main.allow-bean-definition-overriding=true
spring.batch.initializer.enabled=false
spring.batch.job.enabled=false
spring.batch.initialize-schema=never
chunk-size=100
max-threads=5
You can enable JDBC batch processing for INSERT, UPDATE and DELETE statements with just one configuration property:
spring.jpa.properties.hibernate.jdbc.batch_size
It determines the number of updates that are sent to the database at one time for execution.
For details, see this link
Thank you all for the suggestions. I found the issue myself. I was using JdbcPagingItemReader and RepositoryItemWriter. The reader was working as expected, but the writer was triggering a select query for each record passed after processor. I believe reason behind is that the the record is persistent to JPA only after processor since the reader is not a standard JPA reader. I am not sure about it though. But changing the writer to JdbcBatchItemWriter fixed the issue.
I want to start the same job configuration multiple times with different filename parameters, to import multiple files concurrently (each job one file).
#Configuration
public class MyJob {
//the steps contain each a reader, processor and writer
//reader is #JobScope
#Bean
public Job job(Step someStep, Step, someMoreStep) {
System.out.println("registering job bean");
return jobBuilderFactory.get(name)
.start(someStep)
.next(someMoreStep)
.build();
}
}
#Component
public class MyImporter {
#Autowired
private JobRegistry jobRegistry;
#Autowired
private JobLauncher launcher;
private static final String FILENAME = "baseFilename";
#Async
public void run(String i) {
p = new JobParametersBuilder();
p.addDate("date", new Date());
p.addString("filename", FILENAME + i + ".csv"); //injected via #Value jobParameter to job
Job job = jobRegistry.getJob("getMyJob");
launcher.run(job, p.toJobParameters());
}
}
#Component
public class MyImportManager {
#Autowired
private MyImporter importer;
//starts a job multiple times with different "filename" parameters,
//to simulate concurrent file imports with the same configuration job class
public void start() {
for (int i = 0; i < 4; i++) {
importer.run(i);
}
}
}
#SpringBootApplication
#EnableBatchProcessing(modular = true)
#EnableAsync
public class MyConfig {
}
Problem: I can start multiple jobs, but they seem to use eg the same reader. If I run a single job i < 1, everything is working fine. If I increase i, I'm getting weired inputs.
I'm using FlatFileItemReader where each value is read line by line. But when using concurrent imports (from different files), the input strings are often corrupt.
So I assume I'm not registering the jobs properly for concurrent imports? But why?
What is interesting: the line "registering job bean" is only printed once. But shouldn't it be printed as often as the async job is started?
FlatFileItemReader is not thread-safe, so your reader bean will need to be in "job" scope. Then you'll get one instance per job.
#Bean
#JobScope
public FlatFileItemReader<?> yourReaderBean(
#Value("#{jobParameters[filename]}") String filename){
FlatFileItemReader<?> itemReader = new FlatFileItemReader<?>();
itemReader.setLineMapper(lineMapper());
itemReader.setResource(new ClassPathResource(filename));
return itemReader;
}
Introduction
I am trying to use jobparameters created in a tasklet to create steps following the execution of the tasklet.
A tasklet tries to finds some files (findFiles()) and if it finds some files it saves the filenames to a list of strings.
In the tasklet I pass the data as following:
chunkContext.getStepContext().getStepExecution().getExecutionContext().put("files", fileNames);
The next step is a parallel flow where for each file a simple reader-processor-writer step will be executed (if you are interested in how I got there please see my previous question: Spring Batch - Looping a reader/processor/writer step)
Upon building the job readFilesJob() a flow is created initially using a "fake" list of files because only after the tasklet has been executed the real list of files is known.
Question
How do I configure my job so the tasklet gets executed first and then the parallel flow gets executed using the list of files generated from the tasklet?
I think it comes down to getting the list of filenames loaded with the correct data at the correct moment during runtime... but how?
Reproduce
Here is my simplified configuration:
#Configuration
#EnableBatchProcessing
public class BatchConfiguration {
private static final String FLOW_NAME = "flow1";
private static final String PLACE_HOLDER = "empty";
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
public List<String> files = Arrays.asList(PLACE_HOLDER);
#Bean
public Job readFilesJob() throws Exception {
List<Step> steps = files.stream().map(file -> createStep(file)).collect(Collectors.toList());
FlowBuilder<Flow> flowBuilder = new FlowBuilder<>(FLOW_NAME);
Flow flow = flowBuilder
.start(findFiles())
.next(createParallelFlow(steps))
.build();
return jobBuilderFactory.get("readFilesJob")
.start(flow)
.end()
.build();
}
private static Flow createParallelFlow(List<Step> steps){
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(steps.size());
List<Flow> flows = steps.stream()
.map(step ->
new FlowBuilder<Flow>("flow_" + step.getName())
.start(step)
.build())
.collect(Collectors.toList());
return new FlowBuilder<SimpleFlow>("parallelStepsFlow").split(taskExecutor)
.add(flows.toArray(new Flow[flows.size()]))
.build();
}
private Step createStep(String fileName){
return stepBuilderFactory.get("readFile" + fileName)
.chunk(100)
.reader(reader(fileName))
.writer(writer(filename))
.build();
}
private FileFinder findFiles(){
return new FileFinder();
}
}
Research
The question and answer from How to safely pass params from Tasklet to step when running parallel jobs suggest the usage of a construct like this in the reader/writer:
#Value("#{jobExecutionContext[filePath]}") String filePath
However, I really hope it is possible to pass the fileName as a string to the reader/writer due to the way the steps are created in the createParallelFlow() method. Therefore, even tho the answer to that question might be a solution for my problem here, it is not the desired solution. But please do not refrain from correcting me if I am wrong.
Closing
I am using the file names example to clarify the problem better. My problem is not actually the reading of multiple files from a directory. My question really boils down to the idea of generating data during runtime and passing it to the next dynamically generated step(s).
EDIT:
Added a simplified tasklet of the fileFinder.
#Component
public class FileFinder implements Tasklet, InitializingBean {
List<String> fileNames;
public List<String> getFileNames() {
return fileNames;
}
#PostConstruct
public void afterPropertiesSet() {
// read the filenames and store dem in the list
fileNames.add("sample-data1.csv");
fileNames.add("sample-data2.csv");
}
#Override
public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
// Execution of methods that will find the file names and put them in the list...
chunkContext.getStepContext().getStepExecution().getExecutionContext().put("files", fileNames);
return RepeatStatus.FINISHED;
}
}
I'm not sure, if I did understand your problem correctly, but as far as I see, you need to have the list with the filenames before you build your job dynamically.
You could do it like this:
#Component
public class MyJobSetup {
List<String> fileNames;
public List<String> getFileNames() {
return fileNames;
}
#PostConstruct
public void afterPropertiesSet() {
// read the filenames and store dem in the list
fileNames = ....;
}
}
After that, you can inject this Bean inside your JobConfiguration Bean
#Configuration
#EnableBatchProcessing
#Import(MyJobSetup.class)
public class BatchConfiguration {
private static final String FLOW_NAME = "flow1";
private static final String PLACE_HOLDER = "empty";
#Autowired
private MyJobSetup jobSetup; // <--- Inject
// PostConstruct of MyJobSetup was executed, when it is injected
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
public List<String> files = Arrays.asList(PLACE_HOLDER);
#Bean
public Job readFilesJob() throws Exception {
List<Step> steps = jobSetUp.getFileNames() // get the list of files
.stream() // as stream
.map(file -> createStep(file)) // map...
.collect(Collectors.toList()); // and create the list of steps