I have a batch job that reads from and Oracle database, stores JPA with JPA ITEM READER and writes to MariaDB with JPA ITEM WRITER. Is there a way to do bulk insert into MariaDB or a bulk execute the way mongodb has bulkoperations.execute() method?
I've used the provided JpaItemWriter as Follows:
#Bean
#Transactional
public JpaItemWriter<entity.maria.class> classJpaItemWriter() {
JpaItemWriter<entity.maria.class> writer = new JpaItemWriter<>();
writer.setEntityManagerFactory(mariaEntityManager.getObject());
return writer;
}
The Reader is :
public JpaPagingItemReader<PojoClass> classJpaReader() throws Exception {
String jpqlQuery = "SELECT t FROM PojoClass t where rownum < 15001";
JpaPagingItemReader<PojoClass> reader = new JpaPagingItemReader<>();
reader.setQueryString(jpqlQuery);
reader.setEntityManagerFactory(oracleEntityManager.getObject());
reader.setPageSize(100000);
reader.afterPropertiesSet();
reader.setSaveState(true);
return reader;
}
The Step Configuration is:
#Bean
public Step classStep() throws Exception {
Step auditStep = stepBuilderFactory.get(
"entity.oracle.PojoClass").<class, entity.maria.class>chunk(
10000)
.reader(classJpaReader())
.writer(classJpaItemWriter())
.transactionManager(mariaTransactionManager)
.listener(auditWriterListener())
.faultTolerant()
.skipLimit(10000)
.skip(Exception.class)
.build();
return auditStep;
}
I would like to do a custom writer that would bulk insert values into mariaDb and would like the time for insert/upsert to decrease. Currently the time taken for insertion of 15000 is 326 seconds...This seems kind of lengthy.
Any Suggestions?
You can try the JdbcBatchItemWriter which uses JdbcTemplate#batchUpdate behind the scene. This is usually faster the JPA item writers as it does not interact with a persistent context, first/second level cache, etc.
Hope this helps.
Related
I have created a Generic Spring Batch job for Processing of Data and storing into a CSV. I need some data from the Reader passed into the writer which I am trying to do using JobExecution. However suprisingly, the code seems to call the getWriter() first than the getReader() function.
My config is given below. Could someone explain why it is happening and if there is any alternative way to pass data from reader to writer.
#Bean
#StepScope
public ItemReader<Map<String, Object>> getDataReader() throws Exception {
return springBatchReader.getReader();
}
#Bean
#StepScope
public FlatFileItemWriter<Map<String, Object>> getDataWriter() throws Exception {
return (FlatFileItemWriter<Map<String, Object>>) springBatchWriter.getWriter();
}
#Bean
public Job SpringBatchJob(Step generateReport) throws Exception {
return jobBuilderFactory.get("SpringBatchJob" + System.currentTimeMillis())
.preventRestart()
.incrementer(new RunIdIncrementer())
.flow(generateReport)
.end()
.build();
}
#Bean
public Step generateReport() throws Exception {
return stepBuilderFactory.get("generateReport").<Map<String, Object>, Map<String, Object>>chunk(batchSize)
.reader(getDataReader()).writer(getDataWriter()).build();
}
The Data I want to pass from Reader to Writer is the column names for the CSV. Since my Reader runs variable SQL queries(passing the SQL query to be run as a command line argument) and hence the result-set/columns are not static and vary based on the given query. To provide the writer with the column names to be written for that particular execution in the setHeaderCallback was the rationale behind sending data from Reader to Writer.
The Reader simple runs the given query and puts the data into Map<String, Object> rather than any POJO due to the variable nature of the data. Here the key of the Map represent the column name while the corresponding object holds the values for that column. So essentially I want the writer setHeaderCallback to be able to access Keys of the passed Map or pass the keys from the Reader to the Writer somehow.
The Writer Code is as follows:
public FlatFileItemWriter<Map<String, Object>> getWriter() throws Exception {
String reportName = getReportName();
saveToContext(reportName, reportPath);
FileSystemResource resource = new FileSystemResource(String.join(File.separator, reportPath, getReportName()));
FlatFileItemWriter<Map<String, Object>> flatFileItemWriter = new FlatFileItemWriter<>();
flatFileItemWriter.setResource(resource);
//NEED HELP HERE..HOW TO SET THE HEADER TO BE THE KEYS OF THE MAP
//flatFileItemWriter.setHeaderCallback();
flatFileItemWriter.setLineAggregator(new DelimitedLineAggregator<Map<String, Object>>() {
{
setDelimiter(delimiter);
setFieldExtractor(
new PassThroughFieldExtractor<>()
);
}
});
flatFileItemWriter.afterPropertiesSet();
return flatFileItemWriter;
}
The execution order of those methods does not matter. You should not be looking for a way to pass data from the reader to the writer using the execution context, the Chunk-oriented Tasklet implementation provided by Spring Batch will do that for you.
The execution context could be used to pass data from one step to another, but not from the reader to the writer within the same step.
EDIT: update answer based on comments:
Your issue is that you are calling saveToContext(reportName, reportPath); in the getWriter method. This method is called at configuration time and not at runtime.
What you really need is provide the column names either via job parameters or put them in the execution context with a step, then use a step-scoped Header callback that is configured with those headers.
You can find an example here: https://stackoverflow.com/a/56719077/5019386. This example is for the lineMapper but you can do the same for the headerCallback. If you don't want to use the job parameters approach, you can create a tasklet step that determines column names and puts them in the execution context, then configure the step-scoped header callback with those names from the execution context, something like:
#Bean
#StepScope
public FlatFileHeaderCallback headerCallback(#Value("#{jobExecutionContext['columnNames']}") String columnNames) {
return new FlatFileHeaderCallback() {
#Override
public void writeHeader(Writer writer) throws IOException {
// use columnNames here
}
};
}
I hope you're doing great.
I'm facing design problem in spring batch.
Let me explain:
I have a modular spring batch job architecture,
each job has its own config file et context.
I am designing a master Job to launch the subjobs (50+ types of subjobs).
X obj has among other name, state and blob which contains the csv file attached to it.
X obj will be updated after being processed.
I follow the first approach of fetching all X obj and then looping (in java stream) to call the appropriate job.
But this approach has a lot of limitations.
So I design a masterJob with reader processor and writer.
MasterJob should read X obj and call the appropriate subJob and the update the state of X obj.
masterJobReader which call a custom service to get a list of let's say X obj.
I started by trying to launch subjob from within the masterJob processor but It did not work.
I did some research and I find that JobStep could be more adequate for this scenario.
But I'm stuck with how to pass the item read by masterJobReader to JobStep has parameter.
I did saw DefaultJobParameterExtractor and I try to set the Item read into the stepExecutionContext but It's not working.
My question how to pass parameter from MasterJob to SubJob using
JobStep approach?
If there is better way to deal with this then I'm all yours!
I'm using Java Config and spring batch 4.3.
Edit to provide sample code:
#Configuration
public class MasterJob {
#Value("${defaultCompletionPolicy}")
private Integer defaultCompletionPolicy;
#Autowired
protected StepBuilderFactory masterStepBuilderFactory;
private Logger logger = LoggerFactory.getLogger(MasterJob.class);
#Autowired
protected JobRepository jobRepo;
#Autowired
protected PlatformTransactionManager transactionManager;
#Autowired
#Qualifier("JOB_NAME1")
private Job JOB_NAME1; // this should change to be dynamic as there are around 50 types of job
#Bean(name = "masterJob")
protected Job masterBatchJob() throws ApiException {
return new JobBuilderFactory(jobRepo).get("masterJob")
.incrementer(new RunIdIncrementer())
.start(masterJobStep(masterJobReader(), masterJobWriter()))
.next(jobStepJobStep1(null))
.next(masterUpdateStep()) // update the state of objX
.build();
}
#Bean(name = "masterJobStep")
protected Step masterJobStep(#Qualifier("masterJobReader") MasterJobReader masterReader,
#Qualifier("masterJobWriter") MasterJobWriter masterWriter) throws ApiException {
logger.debug("inside masterJobStep");
return this.masterStepBuilderFactory.get("masterJobStep")
.<Customer, Customer>chunk(defaultCompletionPolicy)
.reader(masterJobReader())
.processor(masterJobProcessor())
.writer(masterJobWriter())
.transactionManager(transactionManager)
.listener(new MasterJobWriter()) // I set the parameter inside this.
.listener(masterPromotionListener())
.build();
}
#Bean(name = "masterJobWriter", destroyMethod = "")
#StepScope
protected MasterJobWriter masterJobWriter() {
return new MasterJobWriter();
}
#Bean(name = "masterJobReader", destroyMethod = "")
#StepScope
protected MasterJobReader masterJobReader() throws ApiException {
return new MasterJobReader();
}
protected FieldSetMapper<Customer> mapper() {
return new CustomerMapper();
}
#Bean(name="masterPromotionListener")
public ExecutionContextPromotionListener masterPromotionListener() {
ExecutionContextPromotionListener listener = new ExecutionContextPromotionListener();
listener.setKeys(
new String[]
{
"inputFile",
"outputFile",
"customerId",
"comments",
"customer"
});
//listener.setStrict(true);
return listener;
}
#Bean(name = "masterUpdateStep")
public Step masterUpdateStep() {
return this.masterStepBuilderFactory.get("masterCleanStep").tasklet(new MasterUpdateTasklet()).build();
}
#Bean(name = "masterJobProcessor", destroyMethod = "")
#StepScope
protected MasterJobProcessor masterJobProcessor() {
return new MasterJobProcessor();
}
#Bean
public Step jobStepJobStep1(JobLauncher jobLauncher) {
return this.masterStepBuilderFactory.get("jobStepJobStep1")
.job(JOB_NAME1)
.launcher(jobLauncher)
.parametersExtractor(jobParametersExtractor())
.build();
}
#Bean
public DefaultJobParametersExtractor jobParametersExtractor() {
DefaultJobParametersExtractor extractor = new DefaultJobParametersExtractor();
extractor.setKeys(
new String[] { "inputFile", "outputFile", , "customerId", "comments", "customer" });
return extractor;
}
}
This is how I set parameter from within the MasterJobWriter:
String inputFile = fetchInputFile(customer);
String outputFile = buildOutputFileName(customer);
Comments comments = "comments"; // from business logic
ExecutionContext stepContext = this.stepExecution.getExecutionContext();
stepContext.put("inputFile", inputFile);
stepContext.put("outputFile", outputFile);
stepContext.put("customerId", customer.getCustomerId());
stepContext.put("comments", new CustomJobParameter<Comments>(comments));
stepContext.put("customer", new CustomJobParameter<Customer>(customer));
I follow this section of the documentation of spring batch
My question how to pass parameter from MasterJob to SubJob using JobStep approach?
The JobParametersExtractor is what you are looking for. It allows you to extract parameters from the main job and pass them to the subjob. You can find an example here.
EDIT: Adding suggestions based on comments
I have a list of X obj in the DB. X obj has among other fields, id, type(of work), name, state and blob which contains the csv file attached to it. The blob field containing the csv file depends on the type field so it's not one pattern csv file. I need to process each X obj and save the content of the csv file in the DB and generate a csv result file containing the original data plus a comment field in the result csv file and update X obj state with the result csv field attached to X obj and other fields.
As you can see, the process is already complex for a single X object. So trying to process all X objects in the same job of jobs is too complex IMHO. So much complexity in software comes from trying to make one thing do two things..
If there is better way to deal with this then I'm all yours!
Since you are open for suggestions, I will recommend two options:
Option 1:
If it were up to me, I would create a job instance per X obj. This way, I can 1) parallelize things and 2) in case of failure, restart only the failed job. These two characteristics (Scalability and Restartability) are almost impossible with the job of jobs approach. Even if you have a lot of X objects, this is not a problem. You can use one of the scaling techniques provided by Spring Batch to process things in parallel.
Option 2:
If you really can't or don't want to use different job instances, you can use a single job with a chunk-oriented step that iterates over X objects list. The processing logic seems independent from one record to another, so this step should be easily scalable with multiple threads.
Hi I would like to stream a very large table spring-data-jdbc. For this purpose
I have set my connection to READ_ONLY I have declared in my repository a method that looks in the following way:
PackageRepository extends Repository<Package,String> {
Stream<Package> findAll();
}
My expectation here would be that the resultset would be of type FORWARD_ONLY and this method will not block indefinatly untill all results are recieved from the database.
Here I would make a comparison with Spring Data JPA where the Stream methods are not blocking and the content of the database is fetched in portions depending on the fetch size.
Have I missed some configuration ? How can I achieve this behaviour with spring-data-jdbc ?
UPDATE: I will put the question in a different form. How can I achieve with spring-data-jdbs the equivalent of:
template.query(new PreparedStatementCreator() {
#Override
public PreparedStatement createPreparedStatement(Connection con) throws SQLException {
PreparedStatement statement = con.prepareStatement("select * from MYTABLE with UR",ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
statement.setFetchSize(150000);
return statement;
}
}, new RowCallbackHandler() {
#Override
public void processRow(ResultSet rs) throws SQLException {
// do my processing here
}
});
Just adding thesetFetchSize(Integer.MIN_VALUE) before querying, the queryForStream indeed gives us a stream which load records one by one rather than eagerly load all records into memroy in one shot.
namedTemplate.getJdbcTemplate().setFetchSize(Integer.MIN_VALUE);
Stream<LargeEntity> entities = namedTemplate.queryForStream(sql, params, rowMapper);
dependencies:
spring framework 5.3+
mysql-connector-java 8.0.x (or mariadb-java-client 2.7.x)
I have a situation where the user needs to import a file worth of data into a database. They can select at the start whether to do a standard import and then create a report summary file, or to 'simulate' the import (ie create the same report summary, but not actually import anything). The basic setup is below:
public class Importer {
#Autowired
protected IConverter converter;
#Autowired
protected ISerializer serializer;
#Autowired
protected IReporter reporter;
public void import( InputStream stream ) throws Exception {
CustomerData data = converter.convert( stream );
// ** database at this point has been updated! **
if( getContext().isSerialize() ) {
serializer.serialize( data );
}
if( getContext().isReport() ) {
reporter.report( data, "report.xls" );
}
}
}
public class Converter implements IConverter throws Exception {
#Transactional
public CustomerData convert( InputStream stream ) {
try {
CustomerData data = ... // read file and create/match with db entities
return data;
} finally {
if( !getContext().isSerialize() ) {
// clear any changes made to objects made in the db
getEntityManager().clear();
// ** database at this point is unaffected **
}
}
}
}
The Importer is a bean class configured in Spring 4.1. Database is JPA 2.1/Hibernate 4.3.11/MySQL 5.5. Using Java 8.
The CustomerData object is a tree of database entity objects, some of which have been matched with data in the database, (potentially with properties updated with data from the import file) and others which are new entites.
isSerialize() and isReport() allows control over whether the database is updated. When simulating the import, isSerialize() = false, isReport() = true.
Stepping through the code, when I enter the finally block and clear the entity manager, the data in the database is as it was before the import. However when I return to the import() method the database has been updated with the changes to the entities!
Clearly the transactional import() method completing commits the data, but why did the clear of the entity manager not stop the changes from happening? To make sure I set a breakpoint on [Hibernate] AbstractEntityManagerImpl.flush() and it's not called at all here.
Could someone please help me understand why clear() doesn't work, and what I should be doing instead.
Thanks, S. Piller for putting me on the right track. For anyone in a similar position, clearing the entity manager won't stop flushed changes being committed to the database at the end of the transaction - clear() will only clear changed entities since the last flush.
The way around it is to notify the transaction manager that the transaction needs to be rolled back:
#Transactional
public CustomerData convert( InputStream stream ) {
try {
CustomerData data = ... // read file and create/match with db entities
return data;
} finally {
if( !getContext().isSerialize() ) {
// Ensure the current transaction is rolled back once the topmost #Transactional method completes.
TransactionAspectSupport.currentTransactionStatus().setRollbackOnly();
}
}
}
From a high level, my application flow looks like
REST Controller RequestMapping is triggered by a GET() request. REST Controller calls a method in a Service class.
#RequestMapping(value="/eventreports", method = RequestMethod.POST, produces = "application/json")
public #ResponseBody List<EventReports> addReportIds(#RequestBody List<Integer> reportIds) {
List<EventReports> eventReports = railAgentCollectorServiceImpl.addReportIds(reportIds);
return eventReports;
}
Service method calls a methodin a DAO class.
#Override
public List<EventReports> addReportIds(List<Integer> reportIds) {
List<EventReports> eventReports = eventReportsDAOImpl.listEventReportsInJsonRequest(reportIds);
return eventReports;
}
DAO method executes a StoredProcedureQuery against a SQL datasource that returns results as an ArrayList of domain objects. Service class passes this Arraylist of domain objects back to REST Controller, which returns the ArrayList of domain objects as a JSON string.
#Override
public List<EventReports> listEventReportsInJsonRequest(List<Integer> reportIds) {
ArrayList<EventReports> erArr = new ArrayList<EventReports>();
try {
StoredProcedureQuery q = em.createStoredProcedureQuery("sp_get_event_reports", "eventReportsResult");
q.registerStoredProcedureParameter("reportIds", String.class, ParameterMode.IN);
q.setParameter("reportIds", reportIdsList);
boolean isResultSet = q.execute(); //try catch here
erArr = (ArrayList<EventReports>) q.getResultList();
} catch (Exception e) {
System.out.println("No event reports found for list " + reportIdsList);
}
return erArr;
}
I've been investigating integrating Spring Batch processing into the above pattern. I've been looking at the Spring getting started guide for batch processing here https://spring.io/guides/gs/batch-processing/ - paying particular attention to the source code for BatchConfiguration.java - I'm uncertain whether my application is suited for Spring Batch, maybe my imcomplete knowledge of Spring Batch and the various ways it can be implemented is preventing me from conceptualizing it. The BatchConfiguration.java code below suggests to me that Spring Batch may be best suited to iterate through a list of items, read them one by one, process them one by one, and write them one by one - whereas my service code is based on gathering and and writing a list of objects all at once.
#Bean
public FlatFileItemReader<Person> reader() {
FlatFileItemReader<Person> reader = new FlatFileItemReader<Person>();
reader.setResource(new ClassPathResource("sample-data.csv"));
reader.setLineMapper(new DefaultLineMapper<Person>() {{
setLineTokenizer(new DelimitedLineTokenizer() {{
setNames(new String[] { "firstName", "lastName" });
}});
setFieldSetMapper(new BeanWrapperFieldSetMapper<Person>() {{
setTargetType(Person.class);
}});
}});
return reader;
}
#Bean
public PersonItemProcessor processor() {
return new PersonItemProcessor();
}
#Bean
public JdbcBatchItemWriter<Person> writer() {
JdbcBatchItemWriter<Person> writer = new JdbcBatchItemWriter<Person>();
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<Person>());
writer.setSql("INSERT INTO people (first_name, last_name) VALUES (:firstName, :lastName)");
writer.setDataSource(dataSource);
return writer;
}
// end::readerwriterprocessor[]
// tag::jobstep[]
#Bean
public Job importUserJob(JobCompletionNotificationListener listener) {
return jobBuilderFactory.get("importUserJob")
.incrementer(new RunIdIncrementer())
.listener(listener)
.flow(step1())
.end()
.build();
}
#Bean
public Step step1() {
return stepBuilderFactory.get("step1")
.<Person, Person> chunk(10)
.reader(reader())
.processor(processor())
.writer(writer())
.build();
}
Is this true? Could I still take advantage of resume-ability, scheduling and synchronization provided by Spring Batch for my existing code? Any suggestions appreciated.
I think the main thing that you need to consider this synchronous behavior and asynchronous behavior. Batch process are used for long running tasks, So,
Consider your task it long running or not. If your task is long running you can use batch. This is going to be asynchronous because, your request come in and start the task and then respond back to the user.
And the batch will run and complete and write a result back to the database, User will have to either poll for the result using ajax or you may have to implement push notification mechanism to handle the state of the task for the asynchronous behavior/ prevent polling.
Its true that a Spring Batch chunk consisting of a reader -> processor -> writer reads one item, processes one item but writes a chunk of items according to defined chunk size.
So you can send thousand of items in one go to writer to write to storage depending on your defined chunk_size.
Having said that, a reader reads one item but its not necessary to read only one item from source(from file/DB etc) itself. There are readers which can read a large quantity of items in one go from source, hold it in itself in a list and hand over one by one to processor till list is exhausted.
One such reader is JdbcPagingItemReader so e.g. it reads few thousand rows from database in one go as per defined reader page_size ( that reduces DB calls significantly ) and then keep handing over one by one to processor and then processor automatically keep accumulating processed outputs till chunk_size is reached and then hands over to writer in bulk.
Its just another case that something might not be ready off the shelf for your requirement in API - in that case, you will have to write your own ItemReader.
Look at the code of JdbcPagingItemReader to get ideas.
For your situation, writer of Spring Batch doesn't seem a problem at all, it already writes in bulk with just a simple configuration. You will have to feed Controller's output to reader which works on similar lines as JdbcPagingItemReader.
All I want to say that in-memory processing is one by one ( and that is very fast ) but IO can be done in bulk in spring batch ( if you choose so).
Hope it helps !!