Difference between Spring JDBCTemplate ResultSetExtractor and Spring Batch ItemReader - java

I do have a large MySQL database table (more than 1 million records). I need to read all data and do some processing on them using java language.
I want to make sure that the java process shouldn't consume more memory by taking the entire result set in memory.
While looking at cursor based implementations, I found some options,
Using Spring JDBCTemplate override ResultSetExtractor or RowCallbackHandler and reading row sequentially.
Other options using Spring Batch JDBCCursorItemReader/JDBCPagingItemReader.
Can someone explain what is the difference between these two options ?

Option 1 seems better with some internal batching at your application side, if you require any batching.
JdbcCursorItemReader opens a new connection and hence will not participate in your application transaction. See the API at http://docs.spring.io/spring-batch/apidocs/org/springframework/batch/item/database/AbstractCursorItemReader.html This API is part of spring batch. If you are writing a batch processing application then it will be well suited. See spring batch

Related

Is there way to record audit database in spring boot using Spring Data Envers with spring JDBCTemplate?

I'm creating a spring web application that uses the MySQL database with spring JDBCTemplate. The problem is I want to record any changes in data that store in the MySQL database. I couldn't find any solution for Spring Data Envers with JDBCTemplates to record the changes.
What is the best way to record any changes of data of database? or by simply writing a text file on the spring app?
Envers on which Spring Data Envers builds is an add on of Hibernate and uses it's change detection mechanism in order to trigger writing revisions to the database.
JdbcTemplate doesn't have any of that, it just eases the execution of SQL statements by abstracting away repetitive tasks like exception handling or iterating over the ResultSet of queries. JdbcTemplate has no knowledge of what the statement it is executing is actually doing.
As so often you have a couple of options:
put triggers on your database that record changes.
use some database dependent feature like Oracles Change Data Capture
You could create a wrapper of JdbcTemplate which analyses the SQL statement and produces a log entry. This is only feasible when you need only very limited information, like what kind of statement was executed and which table was affected.
If you need more semantic information it is probably best to use an even higher level of your application stack like the controller or service to gather the relevant information and write it to the database. Probably using the JdbcTemplate as well.

Spring Batch Job hangs - concurrent steps and each step using multiple threads

I am using Spring Batch for processing records from database tables using the below scenario:
Processing data from 5 tables concurrently using 5 parallel steps
Each parallel step has further 5 threads to process records from single table
Here is the summary of job configuration: TestJob -> Parallel Step1 & Step2 -> Step 1 using 2 threads, Step 2 using 2 threads
For Spring Batch tables I tried using SQL Server database, HSQL in memory database but somehow Spring batch stucks when selecting from BATCH_STEP_EXECUTION_SEQ
Spring batch trying to INSERT into BATCH_STEP_EXECUTION table so trying to get ID from BATCH_STEP_EXECUTION_SEQ table where it hangs.
I am using Spring Boot 2.2.2.RELEASE version. I tried override jobrepository configuration with different create isolation levels but problem always persists.
NOTE:
Everything is working as expected when:
Processing concurrent tables at a time and each table processed by single thread
Processing single table at a time and single table processed by multiple threads
Any help/pointer to fix the problem is highly appreciated.
Thanks,
Har Krishan
Just for the sake of others who may be facing the same issue. The issue seems with the configurations of database specific tables and sequences. I tried with SQL Server, still issue persists with the default provided database scripts. Then I tried with Hsql memory database again issue persists. Then I tried using H2 memory database and it worked with that. It also works with MapJobRepositoryFactoryBean.
So you may need to tweak DDL as per database.
Thanks!

How does Spring Batch manage transactions (with possibly multiple datasources)?

I would like some information about the data flow in a Spring Batch processing but fail to find what I am looking for on the Internet (despite some useful questions on this site).
I am trying to establish standards to use Spring Batch in our company and we are wondering how Spring Batch behaves when several processors in a step updates data on different data sources.
This question focuses on a chunked process but feel free to provide information on other modes.
From what I have seen (please correct me if I am wrong), when a line is read, it follows the whole flow (reader, processors, writer) before the next is read (as opposed to a silo-processing where reader would process all lines, send them to the processor, and so on).
In my case, several processors read data (in different databases) and updates them in the process, and finally the writer inserts data into yet another DB. For now, the JobRepository is not linked to a database, but that would be an independent one, making the thing still a bit more complex.
This model cannot be changed since the data belongs to several business areas.
How is the transaction managed in this case? Is the data committed only once the full chunk is processed? And then, is there a 2-phase commit management? How is it ensured? What development or configuration should be made in order to ensure the consistency of data?
More generally, what would your recommendations be in a similar case?
Spring batch uses the Spring core transaction management, with most of the transaction semantics arranged around a chunk of items, as described in section 5.1 of the Spring Batch docs.
The transaction behaviour of the readers and writers depends on exactly what they are (eg file system, database, JMS queue etc), but if the resource is configured to support transactions then they will be enlisted by spring automatically. Same goes for XA - if you make the resource endpoint a XA compliant then it will utilise 2 phase commits for it.
Getting back to the chunk transaction, it will set up a transaction on chunk basis, so if you set the commit interval to 5 on a given tasklet then it will open and close a new transaction (that includes all resources managed by the transaction manager) for the set number of reads (defined as commit-interval).
But all of this is set up around reading from a single data source, does that meet your requirement? I'm not sure spring batch can manage a transaction where it reads data from multiple sources and writes the processor result into another database within a single transaction. (In fact I can't think of anything that could do that...)

spring batch MapJobRepositoryFactoryBean

We get the below error when using spring batch .
org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=8827 with wrong version (1), where current version is 2
What I observed from different forums was that we were using org.springframework.batch.core.repository.support.MapJobRepositoryFactoryBean which is not thread safe and not adivsable to be used in production.
We do not want to persist the meta data of the jobs or use in memory database - Is there any other alternative to MapJobRepositoryFactoryBean ?
Thanks
Lives
According to this post on the spring forums the MapJobRepositoryFactoryBean is not generally intended for production use. I guess I would ask why wouldn't you want the metadata persisted to a database? It provides tremendous value, not to mention giving you the ability to use the spring batch admin console.

Connecting two datasources to the same database

I am working on an application where we have decided to go for a multi-tenant architecture using the solution provided by Spring, so we route the data to each datasource depending on the value of a parameter. Let's say this parameter is a number from 1 to 10, depending on our clients id.
However, this requires altering the application-context each time we add a new datasource, so to start we have thought on the following solution:
Start with 10 datasources (or more) pointing to different IPs and the same schema, but in the end all routed to the same physical database. No matter the datasource we use, the data will be sent to the same schema in this first scenario.
The data would be in the same schema, so the same table would be shared among datasources, but each row would only be visible to each datasource (using a fixed where clause in every CRUD operation)
When we have performance problems, we will create another database, migrate some clients to the new schema, and reroute the IP of one of the datasources to the new database, so this new database gets part of the load of the old one
Are there any drawbacks with this approach? I am concerned about:
ACID properties lost
Problems with hibernate sessionFactory and second level cache
Table locking issues
We are using Spring 3.1, Hibernate 4.1 and MySQL 5.5
i think your spring-link is a little outdated, hibernate 4 can handle multi-tenancy pretty well on it's own. i would suggest to use the multiple schemas approach because setting up and initializing a new schema is programmatically relativly easy to do (for example on registration-time), if you have so much load though (and your database-vendor does not provide a solution to make this transparent to your application) you need the multiple database approach, you should try to incorporate the tenant-id in the database-url or something in that case http://docs.jboss.org/hibernate/orm/4.1/devguide/en-US/html/ch16.html

Categories