Spring Batch memory leak - CSV to database using JpaItemWriter - java

I had a problem with a Spring Batch job for reading a large CSV file (a few million records) and saving the records from it to a database. The job uses FlatFileItemReader for reading the CSV and JpaItemWriter for writing read and processed records to the database. The problem is that JpaItemWriter doesn't clear the persistence context after flushing another chunk of items to the database and the job ends up with OutOfMemoryError.
I have solved the problem by extending JpaItemWriter and overriding the write method so that it calls EntityManager.clear() after writing a bunch, but I was wondering whether Spring Batch addresses this issue already and the root of the problem is in the job config. How to address this issue the right way?
My solution:
class ClearingJpaItemWriter<T> extends JpaItemWriter<T> {
private EntityManagerFactory entityManagerFactory;
#Override
public void write(List<? extends T> items) {
super.write(items);
EntityManager entityManager = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory);
if (entityManager == null) {
throw new DataAccessResourceFailureException("Unable to obtain a transactional EntityManager");
}
entityManager.clear();
}
#Override
public void setEntityManagerFactory(EntityManagerFactory entityManagerFactory) {
super.setEntityManagerFactory(entityManagerFactory);
this.entityManagerFactory = entityManagerFactory;
}
}
You can see the added entityManager.clear(); in the write method.
Job config:
#Bean
public JpaItemWriter postgresWriter() {
JpaItemWriter writer = new ClearingJpaItemWriter();
writer.setEntityManagerFactory(pgEntityManagerFactory);
return writer;
}
#Bean
public Step appontmentInitStep(JpaItemWriter<Appointment> writer, FlatFileItemReader<Appointment> reader) {
return stepBuilderFactory.get("initEclinicAppointments")
.transactionManager(platformTransactionManager)
.<Appointment, Appointment>chunk(5000)
.reader(reader)
.writer(writer)
.faultTolerant()
.skipLimit(1000)
.skip(FlatFileParseException.class)
.build();
}
#Bean
public Job appointmentInitJob(#Qualifier("initEclinicAppointments") Step step) {
return jobBuilderFactory.get(JOB_NAME)
.incrementer(new RunIdIncrementer())
.preventRestart()
.start(step)
.build();
}

That's a valid point. The JpaItemWriter (and HibernateItemWriter) used to clear the persistent context but this has been removed in BATCH-1635 (Here is the commit that removed it). However, this has been re-added and made configurable in the HibernateItemWriter in BATCH-1759 through the clearSession parameter (See this commit) but not in the JpaItemWriter.
So I suggest to open an issue against Spring Batch to add the same option to the JpaItemWriter as well in order to clear the persistence context after writing items (This would be consistent with the HibernateItemWriter).
That's said, to answer your question, you can indeed use a custom writer to clear the persistence context as you did.
Hope this helps.

Related

Spring batch reading files with MultiResourceItemReader and using ItemReadListener

Here's the scenario: I have a Spring Batch that reads multiple input files, processes them, and finally generates more output files.
Using FlatFileItemReader and restarting the entire Batch with a cron, I can process the files 1 by 1, however it is not feasible to restart the batch every X seconds just to process the files individually.
PS: I use ItemReadListener to add some properties of the object being read within a jobExecutionContext, which will be used later to validate (and generate, or not, the output file).
However, if I use MultiResourceItemReader to read all the input files without completely restarting the whole context (and the resources), the ItemReadListener overwrites the properties of each object (input file) in the jobExecutionContext, so that we only have data from the last one object present in the array of input files.
Is there any way to use the ItemReadListener for each Resource read inside a MultiResourceItemReader?
Example Reader:
#Bean
public MultiResourceItemReader<CustomObject> multiResourceItemReader() {
MultiResourceItemReader<CustomObject> resourceItemReader = new MultiResourceItemReader<CustomObject>();
resourceItemReader.setResources(resources);
resourceItemReader.setDelegate(reader());
return resourceItemReader;
}
#Bean
public FlatFileItemReader<CustomObject> reader() {
FlatFileItemReader<CustomObject> reader = new FlatFileItemReader<CustomObject>();
reader.setLineMapper(customObjectLineMapper());
return reader;
}
Example Step:
#Bean
public Step loadInputFiles() {
return stepBuilderFactory.get("loadInputFiles").<CustomObject, CustomObject>chunk(10)
.reader(multiResourceItemReader())
.writer(new NoOpItemWriter())
.listener(customObjectListener())
.build();
}
Example Listener:
public class CustomObjectListener implements ItemReadListener<CustomObject> {
#Value("#{jobExecution.executionContext}")
private ExecutionContext executionContext;
#Override
public void beforeRead() {
}
#Override
public void afterRead(CustomObject item) {
executionContext.put("customProperty", item.getCustomProperty());
}
#Override
public void onReadError(Exception ex) {
}
}
Scheduler:
public class Scheduler {
#Autowired
JobLauncher jobLauncher;
#Autowired
Job job;
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
#Scheduled(fixedDelay = 5000, initialDelay = 5000)
public void scheduleByFixedRate() throws Exception {
JobParameters params = new JobParametersBuilder().addString("time", format.format(Calendar.getInstance().getTime()))
.toJobParameters();
jobLauncher.run(job, params);
}
Using FlatFileItemReader and restarting the entire Batch with a cron, I can process the files 1 by 1, however it is not feasible to restart the batch every X seconds just to process the files individually.
That is the very reason I always recommend the job-per-file approach over the single-job-for-all-files-with-MultiResourceItemReader approach, like here or here.
Is there any way to use the ItemReadListener for each Resource read inside a MultiResourceItemReader?
No, because the listener is not aware of the resource the item was read from. This is a limitation of the approach itself, not in Spring Batch. What you can do though is make your items aware of the resource they were read from, by implementing ResourceAware.

What is the alternative way of PostConstruct annotation?

First of all, thank you for visiting my post.
I have a Spring Batch Project that is using tasklet and chunk.
The chunk aka step1 is where I am processing data and generate a new csv file in the s3 bucket.
The tasklet aka step2 is where I am reading the csv file from s3 bucket that was generated in step1 and sending SNS topic.
Right now, I have a problem with the SNSTopicSender class that implements Tasklet.
first of all, here are the config class and SNSTopicSender class.
#Bean
Step step1() {
return this.stepBuilderFactory
.get("step1")
.<BadStudent, BadStudent>chunk(100)
.reader(new IteratorItemReader<Student>(this.StudentLoader.Students.iterator()) as ItemReader<? extends BadStudent>)
.processor(this.studentProcessor as ItemProcessor<? super BadStudent, ? extends BadStudent>)
.writer(this.csvWriter())
.build()
}
#Bean
Step step2() {
return this.stepBuilderFactory
.get("step2")
.tasklet(new PublishSnsTopic())
.build()
}
#Bean
Job job() {
return this.jobBuilderFactory
.get("scoring-students-batch")
.incrementer(new RunIdIncrementer())
.start(this.step1())
.next(this.step2())
.build()
}
#Configuration
#Service
class SNSTopicSender implements Tasklet {
#Autowired
ResourceLoader resourceLoader
List<BadStudent> badStudents
#Autowired
FileProperties fileProperties
#PostConstruct
void setup() {
String badStudentCSVFileName = "s3://students/failedStudents.csv"
Reader badStudentReader = new InputStreamReader(
this.resourceLoader.getResource(badStudentCSVFileName).inputStream
)
// create message body and call the publishTopic function
}
void publishTopic(SnsClient snsClient, String message, String arn) {
// sending a topic
}
#Override
RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
return RepeatStatus.FINISHED
}
}
I used #PostConstruct because without it, resoruceLoader and fileProperties will be null as it would not have Injection of bean as you know.
To work around it, I used #PostConstruct.
However, I recently realized that SNSTopicSender class is not reading the csv file just created, but it is reading the csv file that was already there before this batch job ran. Because the #PostConstruct gets fired before step1 gets completed.(where the target csv file is created).
Then if I removed the #PostConstruct, then resourceLoader and fileProperties will be null, meaning this batch job would not know where the csv file is stored and also how to read it.
Can anyone help me with this please?

Spring Batch Kafka Kafka to Database Job

I need a spring-batch ItemReader to consume Kafka messages whose results to be processed and written further ahead.
Here's an item reader I have implemented:
public abstract class KafkaItemReader<T> implements ItemReader<List<T>> {
public abstract KafkaConsumer<String, T> getKafkaConsumer();
public abstract String getTopic();
public abstract long getPollingTime();
#Override
public List<T> read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
Iterator<ConsumerRecord<String, T>> iterator = getKafkaConsumer()
.poll(Duration.ofMillis(getPollingTime()))
.records(getTopic())
.iterator();
List<T> records = new ArrayList<>();
while (iterator.hasNext()) {
records.add(iterator.next().value());
}
return records;
}
}
These are the following beans for spring batch job and step:
#Bean
public ItemWriter<List<DbEntity>> databaseWriter(DataSource dataSource) {
//some item writer that needs to be implmented
return null;
}
#Bean
public Step kafkaToDatabaseStep(KafkaItemReader kafkaItemReader, //implementation of KafkaItemReader
StepBuilderFactory stepBuilderFactory,
DataSource dataSource) {
return stepBuilderFactory
.get("kafkaToDatabaseStep")
.allowStartIfComplete(true)
.<List<KafkaRecord>, List<DbEntity>>chunk(100)
.reader(kafkaItemReader)
.processor(itemProcessor()) //List<KafkaRecord> to List<DbEntity> converter
.writer(databaseWriter(dataSource))
.build();
}
#Bean
public Job kafkaToDatabaseJob(
#Qualifier("kafkaToDatabaseStep") Step step) {
return jobBuilderFactory.get("kafkaToDatabaseJob")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
Here I do not know:
How to commit the offset of read messages in the writer as I want to commit only after complete processing of the record.
How to Use JdbcBatchItemWriter as the ItemWriter in my scenario.
The upcoming Spring Batch v4.2 GA will provide support for reading/writing data to Apache Kafka topics. You can already try this out with the 4.2.0.M2 release.
You can also take a look at the Spring Tips installment about it by Josh Long.

Spring Batch multiple insert for a one read

I've a Spring Batch process that read Report objects from a CSV and insert Analytic objects into a MySQL DB correctly, but the logical has changed for a more than one Analytics insert for each Report readed.
I'm new in Spring Batch and the actually process was very difficult for me, and I don't know how to do this change.
I haven't XML configuration, all is with annotations. Report and Analytics classes have a getter and a setter for two fields, adId and value. The new logic has seven values for an adId and I need to insert seven rows into table.
I hide, delete or supress some code that not contribute for the question.
Here is my BatchConfiguration.java:
#Configuration
#EnableBatchProcessingpublic
class BatchConfiguration {
#Autowired
private transient JobBuilderFactory jobBuilderFactory;
#Autowired
private transient StepBuilderFactory stepBuilderFactory;
#Autowired
private transient DataSource dataSource;
public FlatFileItemReader<Report> reader() {
// The reader from the CSV works fine.
}
#Bean
public JdbcBatchItemWriter<Analytic> writer() {
final JdbcBatchItemWriter<Analytic> writer = new JdbcBatchItemWriter<Analytic>();
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<Analytic>());
writer.setSql("INSERT INTO TABLE (ad_id, value) VALUES (:adId, :value)");
writer.setDataSource(dataSource);
return writer;
}
#Bean
public AnalyticItemProcessor processor() {
return new AnalyticItemProcessor();
}
#Bean
public Step step() {
return stepBuilderFactory.get("step1").<Report, Analytic> chunk(10000).reader(reader()).processor(processor()).writer(writer()).build();
}
#Bean
public Job process() {
final JobBuilder jobBuilder = jobBuilderFactory.get("process");
return jobBuilder.start(step()).build();
}
}
Then the AnalyticItemProcessor.java
public class AnalyticItemProcessor implements ItemProcessor<Report, Analytic> {
#Override
public Analytic process(final Report report) {
// Creates a new Analytic call BeanUtils.copyProperties(report, analytic) and returns analytic.
}
}
And the Process:
#SpringBootApplication
public class Process {
public static void main(String[] args) throws Exception {
SpringApplication.run(Process.class, args);
}
}
How can I do this change? Maybe with ItemPreparedStatementSetter or ItemSqlParameterSourceProvider? Thanks.
If I'm understanding your question correctly, you can use the CompositeItemWriter to wrap multiple JdbcBatchItemWriter instances (one per insert you need to accomplish). That would allow you to insert multiple rows per item. Otherwise, you'd need to write your own ItemWriter implementation.

Multiple itemwriters in Spring batch

I am currently writing a Spring batch where I am reading a chunk of data, processing it and then I wish to pass this data to 2 writers. One writer would simply update the database whereas the second writer will write to a csv file.
I am planning to write my own custom writer and inject the two itemWriters in the customItemWriter and call the write methods of both the item writers in the write method of customItemWriter. Is this approach correct? Are there any ItemWriter implementations available which meet my requirements?
Thanks in advance
You can use Spring's CompositeItemWriter and delegate to it all your writers.
here is a configuration example.
You don't necessarily have to use xml like the example. If the rest of your code uses annotation, you could simply do the following.
public ItemWriter<T> writerOne(){
ItemWriter<T> writer = new ItemWriter<T>();
//your logic here
return writer;
}
public ItemWriter<T> writerTwo(){
ItemWriter<T> writer = new ItemWriter<T>();
//your logic here
return writer;
}
public CompositeItemWriter<T> compositeItemWriter(){
CompositeItemWriter writer = new CompositeItemWriter();
writer.setDelegates(Arrays.asList(writerOne(),writerTwo()));
return writer;
}
You were right. SB is heavly based on delegation so using a CompositeItemWriter is the right choice for your needs.
Java Config way SpringBatch4
#Bean
public Step step1() {
return this.stepBuilderFactory.get("step1")
.<String, String>chunk(2)
.reader(itemReader())
.writer(compositeItemWriter())
.stream(fileItemWriter1())
.stream(fileItemWriter2())
.build();
}
/**
* In Spring Batch 4, the CompositeItemWriter implements ItemStream so this isn't
* necessary, but used for an example.
*/
#Bean
public CompositeItemWriter compositeItemWriter() {
List<ItemWriter> writers = new ArrayList<>(2);
writers.add(fileItemWriter1());
writers.add(fileItemWriter2());
CompositeItemWriter itemWriter = new CompositeItemWriter();
itemWriter.setDelegates(writers);
return itemWriter;
}
Depending on your need, another option is to extend the Writer class and add functionality there. For example, I have a project where I am extending HibernateItemWriter and then overriding write(List items). I then send the objects I am writing along with my sessionFactory to the doWrite method of the Writer: doWrite(sessionFactory, filteredRecords).
So in the example above, I could write to the csv file in my extended class and then the HibernateItemWriter would write to the database. Obviously this might not be ideal for this example, but for certain scenarios it is a nice option.
Here's a possible solution. Two writers inside a Composite Writer.
#Bean
public JdbcBatchItemWriter<XPTO> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<XPTO>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("UPDATE xxxx")
.dataSource(dataSource)
.build();
}
#Bean
public JdbcBatchItemWriter<XPTO> writer2(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<XPTO>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("UPDATE yyyyy")
.dataSource(dataSource)
.build();
}
#Bean
public CompositeItemWriter<XPTO> compositeItemWriter(DataSource dataSource) {
CompositeItemWriter<XPTO> compositeItemWriter = new CompositeItemWriter<>();
compositeItemWriter.setDelegates(Arrays.asList( writer(dataSource), writer2(dataSource)));
return compositeItemWriter;
}
#Bean
protected Step step1(DataSource datasource) {
return this.stepBuilderFactory.get("step1").
<XPTO, XPTO>chunk(1).
reader(reader()).
processor(processor()).
writer(compositeItemWriter(datasource)).
build();
}

Categories