Spring Batch Kafka Kafka to Database Job

Spring Batch Kafka Kafka to Database Job - java

I need a spring-batch ItemReader to consume Kafka messages whose results to be processed and written further ahead.
Here's an item reader I have implemented:
public abstract class KafkaItemReader<T> implements ItemReader<List<T>> {
public abstract KafkaConsumer<String, T> getKafkaConsumer();
public abstract String getTopic();
public abstract long getPollingTime();
#Override
public List<T> read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
Iterator<ConsumerRecord<String, T>> iterator = getKafkaConsumer()
.poll(Duration.ofMillis(getPollingTime()))
.records(getTopic())
.iterator();
List<T> records = new ArrayList<>();
while (iterator.hasNext()) {
records.add(iterator.next().value());
}
return records;
}
}
These are the following beans for spring batch job and step:
#Bean
public ItemWriter<List<DbEntity>> databaseWriter(DataSource dataSource) {
//some item writer that needs to be implmented
return null;
}
#Bean
public Step kafkaToDatabaseStep(KafkaItemReader kafkaItemReader, //implementation of KafkaItemReader
StepBuilderFactory stepBuilderFactory,
DataSource dataSource) {
return stepBuilderFactory
.get("kafkaToDatabaseStep")
.allowStartIfComplete(true)
.<List<KafkaRecord>, List<DbEntity>>chunk(100)
.reader(kafkaItemReader)
.processor(itemProcessor()) //List<KafkaRecord> to List<DbEntity> converter
.writer(databaseWriter(dataSource))
.build();
}
#Bean
public Job kafkaToDatabaseJob(
#Qualifier("kafkaToDatabaseStep") Step step) {
return jobBuilderFactory.get("kafkaToDatabaseJob")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
Here I do not know:
How to commit the offset of read messages in the writer as I want to commit only after complete processing of the record.
How to Use JdbcBatchItemWriter as the ItemWriter in my scenario.

The upcoming Spring Batch v4.2 GA will provide support for reading/writing data to Apache Kafka topics. You can already try this out with the 4.2.0.M2 release.
You can also take a look at the Spring Tips installment about it by Josh Long.

Related

Loop Spring Batch

I have a simple job with only one step, but in some way the Batch loops from reader to processor and then to reader again. I can't understand why.
This is the structure:
The reader makes a double select on the same database. The first select needs to search in the first table some records in some state and the second select needs to match those results, get some records from the second table and send them to processor that call an api for every record.
I need to stop the batch running at this point, so after the processor. But I have some problems with this.
Example of my batch:
#Configuration
#EnableBatchProcessing
#EnableScheduling
public class LoadIdemOperationJob {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
public JobLauncher jobLauncher;
#Autowired
public JobRegistry jobRegistry;
#Scheduled(cron = "* */3 * * * *")
public void perform() throws Exception {
JobParameters jobParameters = new JobParametersBuilder()
.addString("JobID", String.valueOf(System.currentTimeMillis()))
.toJobParameters();
jobLauncher.run(jobRegistry.getJob("firstJob"), jobParameters);
}
#Bean
public Job firstJob(Step firstStep) {
return jobBuilderFactory.get("firstJob")
.start(firstStep)
.build();
}
#Bean
public Step firstStep(MyReader reader,
MyProcessor processor) {
return stepBuilderFactory.get("firstStep")
.<List<String>, List<String>>chunk(1)
.reader(reader)
.processor(processor)
.writer(new NoOpItemWriter())
.build();
}
#Bean
#StepScope
public MyReader reader(#Value("${hours}") String hours) {
return new MyReader(hours);
}
#Bean
public MyProcessor processor() {
return new MyProcessor();
}
public static class NoOpItemWriter implements ItemWriter<Object> {
#Override
public void write(#NonNull List<?> items) {
}
}
#Bean
public JobRegistryBeanPostProcessor jobRegistryBeanPostProcessor() {
JobRegistryBeanPostProcessor postProcessor = new JobRegistryBeanPostProcessor();
postProcessor.setJobRegistry(jobRegistry);
return postProcessor;
}
#Bean
public RequestContextListener requestContextListener() {
return new RequestContextListener();
}
}
Example of Reader:
public class MyReader implements ItemReader<List<String>> {
public String hours;
private List<String> results;
#Autowired
private JdbcTemplate jdbcTemplate;
public MyReader(String hours) {
this.hours = hours;
}
#Override
public List<String> read() throws Exception {
results = this.jdbcTemplate.queryForList(// 1^ query, String.class);
if (results.isEmpty()) {
return null;
}
List<String> results = this.jdbcTemplate.queryForList(// 2^ query, String.class);
if (results.isEmpty()) {
return null;
}
return results;
}
}
And Processor:
public class MyProcessor implements ItemProcessor<List<String>, List<String>> {
#Override
public List<String> process(#NonNull List<String> results) throws Exception {
results.forEach(result -> // calling service);
return null;
}
}
Thanks for help!

What you are seeing is the implementation of the chunk-oriented processing model of Spring Batch, where items are read and processed in sequence one by one, and written in chunks.
That said, the design and configuration of your chunk-oriented step is not ideal: the reader returns a List of Strings (so an item in your case is the List itself, not an element from the list), the processor loops over the elements of each List (while it is not intended to do so), and finally there is no item writer (this is a sign that either you don't need a chunk-oriented step, or the step is not well designed).
I can recommend to modify your step design as follows:
The reader should return a single item and not a List. For example, by using the iterator of results and make the reader return iterator.next().
Remove the processor and move its code in the item writer. In fact, the item processor is optional in a chunk-oriented step
Create an item writer with the code of the item processor. Posting results to a REST endpoint is in fact a kind of write operation, so an item writer is definitely better suited than an item processor in this case.
With that design, you should see your chunk-oriented step reading and writing all items from your list without the impression that the job is "looping". This is actually the implementation of the pattern described above.

To user Tasklet or Chunk in this scenario

I have a job/task to read sub-folders/directory of a given folder/path. The path is dynamic, we get it from Controller. Currently, I have used Tasklet, there are 3 tasklets, one to read sub-directories, another to process it to prepare objects to save to DB and last one to write the processed data objects to a database.
The folders can have any number of sub-folders.Currently, I have used this code :
Path start = Paths.get("x:\\data\\");
Stream<Path> stream = Files.walk(start, 1);
List<String> collect = stream
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
To read all the sub folders at once.
I followed this https://www.baeldung.com/spring-batch-tasklet-chunk example of Tasklet implementation for the purpose. Is this the right approach ? I also need to run the Job asynchronously with multi-threading.
As there can be huge numbers of sub-folders, so there can be huge number of rowsorlist of data to process and write to the database.
Please suggest an appropriate approach.
I am learning Spring Batch, have done few examples on file read/process/write too and used Chunk approach for this.
But my job is to read sub-directories of a folder/path, so I cannot decide which approach to follow.

I have a similar scenario: I need to read all the files from a folder, process and write in db, (Doc)
#Configuration
#EnableBatchProcessing
public class BatchConfig {
#Bean
public Job job(JobBuilderFactory jobBuilderFactory,
Step masterStep) {
return jobBuilderFactory.get("MainJob")
.incrementer(new RunIdIncrementer())
.flow(masterStep)
.end()
.build();
}
#Bean
public Step mainStep(StepBuilderFactory stepBuilderFactory,
JdbcBatchItemWriter<Transaction> writer,
ItemReader<String> reader,
TransactionItemProcessor processor) {
return stepBuilderFactory.get("Main")
.<String, Transaction>chunk(2)
.reader(reader)
.processor(processor)
.writer(writer)
**.taskExecutor(jobTaskExecutor())**
.listener(new ItemReaderListener())
.build();
}
#Bean
public TaskExecutor jobTaskExecutor() {
ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
taskExecutor.setCorePoolSize(2);
taskExecutor.setMaxPoolSize(10);
taskExecutor.afterPropertiesSet();
return taskExecutor;
}
#Bean
#StepScope
public ItemReader<String> reader(#Value("#{stepExecution}") StepExecution stepExecution) throws IOException {
Path start = Paths.get("D:\\test");
List<String> inputFile = Files.walk(start, 1)
.map(String::valueOf)
.sorted()
.collect(Collectors.toList());
return new IteratorItemReader<>(inputFile);
}
#Bean
#StepScope
public TransactionItemProcessor processor(#Value("#{stepExecution}") StepExecution stepExecution) {
return new TransactionItemProcessor();
}
#Bean
#StepScope
public JdbcBatchItemWriter<Transaction> writer(DataSource dataSource) {
return new JdbcBatchItemWriterBuilder<Transaction>()
.itemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>())
.sql("INSERT INTO transaction (id, date, type) VALUES (:id, :date, :type)")
.dataSource(dataSource)
.build();
}
}

Spring Batch memory leak - CSV to database using JpaItemWriter

I had a problem with a Spring Batch job for reading a large CSV file (a few million records) and saving the records from it to a database. The job uses FlatFileItemReader for reading the CSV and JpaItemWriter for writing read and processed records to the database. The problem is that JpaItemWriter doesn't clear the persistence context after flushing another chunk of items to the database and the job ends up with OutOfMemoryError.
I have solved the problem by extending JpaItemWriter and overriding the write method so that it calls EntityManager.clear() after writing a bunch, but I was wondering whether Spring Batch addresses this issue already and the root of the problem is in the job config. How to address this issue the right way?
My solution:
class ClearingJpaItemWriter<T> extends JpaItemWriter<T> {
private EntityManagerFactory entityManagerFactory;
#Override
public void write(List<? extends T> items) {
super.write(items);
EntityManager entityManager = EntityManagerFactoryUtils.getTransactionalEntityManager(entityManagerFactory);
if (entityManager == null) {
throw new DataAccessResourceFailureException("Unable to obtain a transactional EntityManager");
}
entityManager.clear();
}
#Override
public void setEntityManagerFactory(EntityManagerFactory entityManagerFactory) {
super.setEntityManagerFactory(entityManagerFactory);
this.entityManagerFactory = entityManagerFactory;
}
}
You can see the added entityManager.clear(); in the write method.
Job config:
#Bean
public JpaItemWriter postgresWriter() {
JpaItemWriter writer = new ClearingJpaItemWriter();
writer.setEntityManagerFactory(pgEntityManagerFactory);
return writer;
}
#Bean
public Step appontmentInitStep(JpaItemWriter<Appointment> writer, FlatFileItemReader<Appointment> reader) {
return stepBuilderFactory.get("initEclinicAppointments")
.transactionManager(platformTransactionManager)
.<Appointment, Appointment>chunk(5000)
.reader(reader)
.writer(writer)
.faultTolerant()
.skipLimit(1000)
.skip(FlatFileParseException.class)
.build();
}
#Bean
public Job appointmentInitJob(#Qualifier("initEclinicAppointments") Step step) {
return jobBuilderFactory.get(JOB_NAME)
.incrementer(new RunIdIncrementer())
.preventRestart()
.start(step)
.build();
}

That's a valid point. The JpaItemWriter (and HibernateItemWriter) used to clear the persistent context but this has been removed in BATCH-1635 (Here is the commit that removed it). However, this has been re-added and made configurable in the HibernateItemWriter in BATCH-1759 through the clearSession parameter (See this commit) but not in the JpaItemWriter.
So I suggest to open an issue against Spring Batch to add the same option to the JpaItemWriter as well in order to clear the persistence context after writing items (This would be consistent with the HibernateItemWriter).
That's said, to answer your question, you can indeed use a custom writer to clear the persistence context as you did.
Hope this helps.

Spring batch JPAItemReader performance Issue

Below is the configuration of my spring batch job which takes records from DB, do some processing in item processor, updates status column and writes back to DB.
When I ran for 10k records, I could see its taking every record one by one and updating status in the same manner. Initially I was planning to use multithreading but it doesn't make any sense as my job runs once in a day with number of records ranging from 10 to 100k. ( Records are less than 5k in most of the days and a very few days in a year ( 5 to 10 days) it comes to 50k to 100k).
I don't want to add more cpus and getting charged by Kubernetes just for 10 days of an year. Now the problem is when I ran this job, it takes only 100 records that it runs every select query independently instead of taking 100 at a time. Also update is also one record at a time and it takes 10 mins to process 10k records which is really slow.
How can do a faster read, process and write? I can get rid of multithreading and have a bit more of CPU utilization once in a while. More information is given as comments in code.
#Configuration
#EnableBatchProcessing
public class BatchConfiguration extends DefaultBatchConfigurer{
public final static Logger logger = LoggerFactory.getLogger(BatchConfiguration.class);
#Autowired
JobBuilderFactory jobBuilderFactory;
#Autowired
StepBuilderFactory stepBuilderFactory;
#Autowired
MyRepository myRepository;
#Autowired
private EntityManagerFactory entityManagerFactory;
#Value("${chunk-size}")
private int chunkSize;
#Value("${max-threads}")
private int maxThreads;
private final DataSource dataSource;
/**
* #param dataSource
* Override to do not set datasource even if a datasource exist during intialization.
* Initialize will use a Map based JobRepository (instead of database) for Spring batch meta tables
*/
#Override
public void setDataSource(DataSource dataSource) {
}
#Override
public PlatformTransactionManager getTransactionManager() {
return jpaTransactionManager();
}
#Autowired
public BatchConfiguration(#Qualifier("dataSource") DataSource dataSource) {
this.dataSource = dataSource;
}
#Bean
public JpaTransactionManager jpaTransactionManager() {
final JpaTransactionManager transactionManager = new JpaTransactionManager();
transactionManager.setDataSource(dataSource);
return transactionManager;
}
#Bean
#StepScope
public JdbcPagingItemReader<ModelEntity> importReader() { // I tried using RepositoryItemReader but records were skipped by JPA hence I went for JdbcPagingItemReader
JdbcPagingItemReader<ModelEntity> reader = new JdbcPagingItemReader<ModelEntity>();
final SqlPagingQueryProviderFactoryBean sqlPagingQueryProviderFactoryBean = new SqlPagingQueryProviderFactoryBean();
sqlPagingQueryProviderFactoryBean.setDataSource( dataSource );
sqlPagingQueryProviderFactoryBean.setSelectClause( "SELECT *" );
sqlPagingQueryProviderFactoryBean.setFromClause( "FROM mytable" );
sqlPagingQueryProviderFactoryBean.setWhereClause( "WHERE STATUS = 'myvalue' " );
sqlPagingQueryProviderFactoryBean.setSortKey( "primarykey" );
try {
reader.setQueryProvider( sqlPagingQueryProviderFactoryBean.getObject() );
} catch (Exception e) {
e.printStackTrace();
}
reader.setDataSource( dataSource );
reader.setPageSize( chunkSize );
reader.setSaveState( Boolean.FALSE );
reader.setRowMapper( new BeanPropertyRowMapper<ModelEntity>(ModelEntity.class ) );
return reader;
}
#Bean
public ItemWriter<ModelEntity> databaseWriter() {
RepositoryItemWriter<ModelEntity> repositoryItemWriter=new RepositoryItemWriter<>();
repositoryItemWriter.setRepository(myRepository);
repositoryItemWriter.setMethodName("save");
return repositoryItemWriter;
}
#Bean
public Myprocessor myprocessor() {
return new Myprocessor();
}
#Bean
public JobExecutionListener jobExecutionListener() {
return new JobExecutionListener();
}
#Bean
public StepExecutionListener stepExecutionListener() {
return new StepExecutionListener();
}
#Bean
public ChunkExecutionListener chunkListener() {
return new ChunkExecutionListener();
}
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
taskExecutor.setConcurrencyLimit(maxThreads);
return taskExecutor;
}
#Bean
public Job processJob() {
return jobBuilderFactory.get("myjob")
.incrementer(new RunIdIncrementer())
.start(processStep())
.listener(jobExecutionListener())
.build();
}
#Bean
public Step processStep() {
return stepBuilderFactory.get("processStep")
.<ModelEntity,ModelEntity>chunk(chunkSize)
.reader(importReader())
.processor(myprocessor())
.writer(databaseWriter())
.taskExecutor(taskExecutor())
.listener(stepExecutionListener())
.listener(chunkListener())
.transactionManager(getTransactionManager())
.throttleLimit(maxThreads)
.build();
}
}
Repository that I am using is JpaRepository and code below. (Assuming save method of its parent class CrudRepository will do save)
public interface MyRepository extends JpaRepository<ModelEntity, BigInteger> {
}
Processor is as below
#Component
public class Myprocessor implements ItemProcessor<Myprocessor,Myprocessor> {
#Override
public ModelEntity process(ModelEntity modelEntity) throws Exception {
try {
// This is fast and working fine
if ((myProcessing)) {
modelEntity.setStatus(success);
} else {
modelEntity.setStatus(failed);
}
}
catch (Exception e){
logger.info( "Exception occurred while processing"+e );
}
return modelEntity;
}
// This is fast and working fine
public Boolean myProcessing(ModelEntity modelEntity){
//Processor Logic Here
return processingStatus;
}
}
Properties file below
logging.level.org.hibernate.SQL=DEBUG
logging.level.com.zaxxer.hikari.HikariConfig=DEBUG
logging.level.org.hibernate.type.descriptor.sql.BasicBinder=TRACE
logging.level.org.springframework.jdbc.core.JdbcTemplate=DEBUG
logging.level.org.springframework.jdbc.core.StatementCreatorUtils=TRACE
spring.datasource.type=com.zaxxer.hikari.HikariDataSource
spring.datasource.url=url
spring.datasource.username=username
spring.datasource.password=password
spring.jpa.hibernate.connection.provider_class
=org.hibernate.hikaricp.internal.HikariCPConnectionProvider
spring.jpa.database-platform=org.hibernate.dialect.Oracle10gDialect
spring.jpa.show-sql=false
spring.main.allow-bean-definition-overriding=true
spring.batch.initializer.enabled=false
spring.batch.job.enabled=false
spring.batch.initialize-schema=never
chunk-size=100
max-threads=5

You can enable JDBC batch processing for INSERT, UPDATE and DELETE statements with just one configuration property:
spring.jpa.properties.hibernate.jdbc.batch_size
It determines the number of updates that are sent to the database at one time for execution.
For details, see this link

Thank you all for the suggestions. I found the issue myself. I was using JdbcPagingItemReader and RepositoryItemWriter. The reader was working as expected, but the writer was triggering a select query for each record passed after processor. I believe reason behind is that the the record is persistent to JPA only after processor since the reader is not a standard JPA reader. I am not sure about it though. But changing the writer to JdbcBatchItemWriter fixed the issue.

Spring Batch - Processor not running after initial run

I am using Spring boot 2.0.5.RELEASE and running a batch process using this:
# prevent auto-start of batch jobs
spring:
batch:
job:
enabled: false
and triggering it manually using a controller end-point (in input are the parameters that are collection from user from the controller):
jobLauncher.run(job, new JobParametersBuilder()
.addDate("date", new Date())
.addJobParameters(new JobParameters(input)).toJobParameters());
Here is my batch configuration:
#Bean
public MongoItemReader<Document> reader() {
MongoItemReader<Document> reader = new MongoItemReader<>();
reader.setTemplate(mongoTemplate);
reader.setCollection(XML_PERSIST_COLLECTION);
reader.setQuery("{}");
Map<String, Sort.Direction> sorts = new HashMap<>(1);
sorts.put("status", Sort.Direction.ASC);
reader.setSort(sorts);
reader.setTargetType(Document.class);
return reader;
}
#Bean
#StepScope
public MyItemProcessor processor() {
return new MyItemProcessor();
}
#Bean
public MongoItemWriter<OutputDto> writer() {
MongoItemWriter<OutputDto> writer = new MongoItemWriter<>();
writer.setTemplate(mongoTemplate);
writer.setCollection(RESPONSE_COLLECTION);
return writer;
}
#Bean
public Step step() {
return stepBuilderFactory.get("step")
.<Document, OutputDto> chunk(1)
.reader(reader())
.processor(processor())
.writer(writer())
.allowStartIfComplete(true)
.build();
#Bean
public Job job(Step step) {
return jobBuilderFactory.get("job")
.incrementer(new RunIdIncrementer())
.flow(step)
.end()
.build();
}
and my processor:
public class MyItemProcessor implements ItemProcessor<Document, OutputDto> {
#Value("#{jobParameters['username']}")
private String username;
#Value("#{jobParameters['password']}")
private String password;
#Override
public OutputDto process(final Document document) throws Exception {
// implementation code
}
}
I am using #StepScope for the processor to extract the job parameters that are passed from my controller.
Issue:
Everything is fine except that the batch job will run only once after the app starts and it will not run again (it runs, but I tried keeping debug points in processor and it is not getting there). I am already adding a timestamp job parameter so that the batch job can be run again, yet the processor is not running more than once (when it should). Any ideas?

The reader() and writer() had singleton scope while the processor() had #StepScope - so looks like that's why the writer() was not getting invoked.
I added #StepScope to the reader and writer and now everything is working fine, though it didn't strike me as instinctive - should have worked without that in 2.0.5.RELEASE.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spring Batch Kafka Kafka to Database Job - java

The upcoming Spring Batch v4.2 GA will provide support for reading/writing data to Apache Kafka topics. You can already try this out with the 4.2.0.M2 release. You can also take a look at the Spring Tips installment about it by Josh Long.

Related

Loop Spring Batch

To user Tasklet or Chunk in this scenario

Spring Batch memory leak - CSV to database using JpaItemWriter

Spring batch JPAItemReader performance Issue

Spring Batch - Processor not running after initial run

Categories

Resources