Related
I have a quite heavy java webapp that serves thousands of requests/sec and it uses a master Postgresql db which replicates itself to one secondary (read-only) database using streaming (asynchronous) replication.
So, I separate the request from primary to secondary(read-only) using URLs to avoid read-only calls to bug primary database considering replication time is minimal.
NOTE: I use one sessionFactory with a RoutingDataSource provided by spring that looks up db to use based on a key. I am interested in multitenancy as I am using hibernate 4.3.4 that supports it.
I have two questions:
I dont think splitting on the basis of URLs is efficient as I can
only move 10% of traffic around means there are not many read-only
URLs. What approach should I consider?
May be,somehow, on the basis of URLs I achieve some level of
distribution among both nodes but what would I do with my quartz
jobs(that even have separate JVM)? What pragmatic approach should I
take?
I know I might not get a perfect answer here as this really is broad but I just want your opinion for the context.
Dudes I have in my team:
Spring4
Hibernate4
Quartz2.2
Java7 / Tomcat7
Please take interest. Thanks in advance.
Spring transaction routing
First, we will create a DataSourceType Java Enum that defines our transaction routing options:
public enum DataSourceType {
READ_WRITE,
READ_ONLY
}
To route the read-write transactions to the Primary node and read-only transactions to the Replica node, we can define a ReadWriteDataSource that connects to the Primary node and a ReadOnlyDataSource that connect to the Replica node.
The read-write and read-only transaction routing is done by the Spring AbstractRoutingDataSource abstraction, which is implemented by the TransactionRoutingDatasource, as illustrated by the following diagram:
The TransactionRoutingDataSource is very easy to implement and looks as follows:
public class TransactionRoutingDataSource
extends AbstractRoutingDataSource {
#Nullable
#Override
protected Object determineCurrentLookupKey() {
return TransactionSynchronizationManager
.isCurrentTransactionReadOnly() ?
DataSourceType.READ_ONLY :
DataSourceType.READ_WRITE;
}
}
Basically, we inspect the Spring TransactionSynchronizationManager class that stores the current transactional context to check whether the currently running Spring transaction is read-only or not.
The determineCurrentLookupKey method returns the discriminator value that will be used to choose either the read-write or the read-only JDBC DataSource.
Spring read-write and read-only JDBC DataSource configuration
The DataSource configuration looks as follows:
#Configuration
#ComponentScan(
basePackages = "com.vladmihalcea.book.hpjp.util.spring.routing"
)
#PropertySource(
"/META-INF/jdbc-postgresql-replication.properties"
)
public class TransactionRoutingConfiguration
extends AbstractJPAConfiguration {
#Value("${jdbc.url.primary}")
private String primaryUrl;
#Value("${jdbc.url.replica}")
private String replicaUrl;
#Value("${jdbc.username}")
private String username;
#Value("${jdbc.password}")
private String password;
#Bean
public DataSource readWriteDataSource() {
PGSimpleDataSource dataSource = new PGSimpleDataSource();
dataSource.setURL(primaryUrl);
dataSource.setUser(username);
dataSource.setPassword(password);
return connectionPoolDataSource(dataSource);
}
#Bean
public DataSource readOnlyDataSource() {
PGSimpleDataSource dataSource = new PGSimpleDataSource();
dataSource.setURL(replicaUrl);
dataSource.setUser(username);
dataSource.setPassword(password);
return connectionPoolDataSource(dataSource);
}
#Bean
public TransactionRoutingDataSource actualDataSource() {
TransactionRoutingDataSource routingDataSource =
new TransactionRoutingDataSource();
Map<Object, Object> dataSourceMap = new HashMap<>();
dataSourceMap.put(
DataSourceType.READ_WRITE,
readWriteDataSource()
);
dataSourceMap.put(
DataSourceType.READ_ONLY,
readOnlyDataSource()
);
routingDataSource.setTargetDataSources(dataSourceMap);
return routingDataSource;
}
#Override
protected Properties additionalProperties() {
Properties properties = super.additionalProperties();
properties.setProperty(
"hibernate.connection.provider_disables_autocommit",
Boolean.TRUE.toString()
);
return properties;
}
#Override
protected String[] packagesToScan() {
return new String[]{
"com.vladmihalcea.book.hpjp.hibernate.transaction.forum"
};
}
#Override
protected String databaseType() {
return Database.POSTGRESQL.name().toLowerCase();
}
protected HikariConfig hikariConfig(
DataSource dataSource) {
HikariConfig hikariConfig = new HikariConfig();
int cpuCores = Runtime.getRuntime().availableProcessors();
hikariConfig.setMaximumPoolSize(cpuCores * 4);
hikariConfig.setDataSource(dataSource);
hikariConfig.setAutoCommit(false);
return hikariConfig;
}
protected HikariDataSource connectionPoolDataSource(
DataSource dataSource) {
return new HikariDataSource(hikariConfig(dataSource));
}
}
The /META-INF/jdbc-postgresql-replication.properties resource file provides the configuration for the read-write and read-only JDBC DataSource components:
hibernate.dialect=org.hibernate.dialect.PostgreSQL10Dialect
jdbc.url.primary=jdbc:postgresql://localhost:5432/high_performance_java_persistence
jdbc.url.replica=jdbc:postgresql://localhost:5432/high_performance_java_persistence_replica
jdbc.username=postgres
jdbc.password=admin
The jdbc.url.primary property defines the URL of the Primary node while the jdbc.url.replica defines the URL of the Replica node.
The readWriteDataSource Spring component defines the read-write JDBC DataSource while the readOnlyDataSource component define the read-only JDBC DataSource.
Note that both the read-write and read-only data sources use HikariCP for connection pooling.
The actualDataSource acts as a facade for the read-write and read-only data sources and is implemented using the TransactionRoutingDataSource utility.
The readWriteDataSource is registered using the DataSourceType.READ_WRITE key and the readOnlyDataSource using the DataSourceType.READ_ONLY key.
So, when executing a read-write #Transactional method, the readWriteDataSource will be used while when executing a #Transactional(readOnly = true) method, the readOnlyDataSource will be used instead.
Note that the additionalProperties method defines the hibernate.connection.provider_disables_autocommit Hibernate property, which I added to Hibernate to postpone the database acquisition for RESOURCE_LOCAL JPA transactions.
Not only that the hibernate.connection.provider_disables_autocommit allows you to make better use of database connections, but it's the only way we can make this example work since, without this configuration, the connection is acquired prior to calling the determineCurrentLookupKey method TransactionRoutingDataSource.
The remaining Spring components needed for building the JPA EntityManagerFactory are defined by the AbstractJPAConfiguration base class.
Basically, the actualDataSource is further wrapped by DataSource-Proxy and provided to the JPA EntityManagerFactory. You can check the source code on GitHub for more details.
Testing time
To check if the transaction routing works, we are going to enable the PostgreSQL query log by setting the following properties in the postgresql.conf configuration file:
log_min_duration_statement = 0
log_line_prefix = '[%d] '
The log_min_duration_statement property setting is for logging all PostgreSQL statements while the second one adds the database name to the SQL log.
So, when calling the newPost and findAllPostsByTitle methods, like this:
Post post = forumService.newPost(
"High-Performance Java Persistence",
"JDBC", "JPA", "Hibernate"
);
List<Post> posts = forumService.findAllPostsByTitle(
"High-Performance Java Persistence"
);
We can see that PostgreSQL logs the following messages:
[high_performance_java_persistence] LOG: execute <unnamed>:
BEGIN
[high_performance_java_persistence] DETAIL:
parameters: $1 = 'JDBC', $2 = 'JPA', $3 = 'Hibernate'
[high_performance_java_persistence] LOG: execute <unnamed>:
select tag0_.id as id1_4_, tag0_.name as name2_4_
from tag tag0_ where tag0_.name in ($1 , $2 , $3)
[high_performance_java_persistence] LOG: execute <unnamed>:
select nextval ('hibernate_sequence')
[high_performance_java_persistence] DETAIL:
parameters: $1 = 'High-Performance Java Persistence', $2 = '4'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post (title, id) values ($1, $2)
[high_performance_java_persistence] DETAIL:
parameters: $1 = '4', $2 = '1'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post_tag (post_id, tag_id) values ($1, $2)
[high_performance_java_persistence] DETAIL:
parameters: $1 = '4', $2 = '2'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post_tag (post_id, tag_id) values ($1, $2)
[high_performance_java_persistence] DETAIL:
parameters: $1 = '4', $2 = '3'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post_tag (post_id, tag_id) values ($1, $2)
[high_performance_java_persistence] LOG: execute S_3:
COMMIT
[high_performance_java_persistence_replica] LOG: execute <unnamed>:
BEGIN
[high_performance_java_persistence_replica] DETAIL:
parameters: $1 = 'High-Performance Java Persistence'
[high_performance_java_persistence_replica] LOG: execute <unnamed>:
select post0_.id as id1_0_, post0_.title as title2_0_
from post post0_ where post0_.title=$1
[high_performance_java_persistence_replica] LOG: execute S_1:
COMMIT
The log statements using the high_performance_java_persistence prefix were executed on the Primary node while the ones using the high_performance_java_persistence_replica on the Replica node.
GitHub Repository
This is not just theory. It's all on GitHub and works like a charm. Use this test case as a reference.
So you can you use it a starting point for your transaction routing solution, as you have a fully-functional example.
Second-level caching
Once you are using replication, you are operating in a distributed environment, so you need to use a distributed caching solution, like Infinispan.
Since we are using replication to distribute traffic to more database nodes, it's obvious that we also have multiple application nodes which have to connect to those database nodes.
Therefore, using the READ_WRITE CacheConcurrencyStrategy in such an environment is a terrible anti-pattern as each distributed node will keep its own copy of the cached entries, leading you to consistency issues even if you didn't use transaction routing.
Not to mention the cold cache issue you'd face if you employed auto-scaling for your application nodes, as they would amplify the database traffic because new nodes would start with a cold cache.
So, if you plan to use transaction routing with the second-level cache mechanism, then you can do better than this.
Use the NONSTRICT_READ_WRITE cache concurrency strategy with a second-level caching provider that can store the cached data in a distributed system of nodes that are readily available even when you create new application nodes.
Conclusion
You need to make sure you set the right size for your connection pools because that can make a huge difference. For this, I recommend using Flexy Pool.
You need to be very diligent and make sure you mark all read-only transactions accordingly. It's unusual that only 10% of your transactions are read-only. Could it be that you have such a write-most application or you are using write transactions where you only issue query statements?
For batch processing, you definitely need read-write transactions, so make sure you enable JDBC batching, like this:
<property name="hibernate.order_updates" value="true"/>
<property name="hibernate.order_inserts" value="true"/>
<property name="hibernate.jdbc.batch_size" value="25"/>
For batching you can also use a separate DataSource that uses a different connection pool that connects to the Primary node.
Just make sure your total connection size of all connection pools is less than the number of connections PostgreSQL has been configured with.
Each batch job must use a dedicated transaction, so make sure you use a reasonable batch size.
More, you want to hold locks and to finish transactions as fast as possible. If the batch processor is using concurrent processing workers, make sure the associated connection pool size is equal to the number of workers, so they don't wait for others to release connections.
You are saying that your application URL's are only 10% read only so the other 90% have at least some form of database writing.
10% READ
You can think about using a CQRS design that may improve your database read performance. It can certainly read from the secondary database, and possibly be made more efficient by designing the queries and domain models specifically for the read/view layer.
You haven't said whether the 10% requests are expensive or not (e.g. running reports)
I would prefer to use a separate sessionFactory if you were to follow the CQRS design as the objects being loaded/cached will most likely be different to those being written.
90% WRITE
As far as the other 90% go, you wouldn't want to read from the secondary database (while writing to the primary) during some write logic as you will not want potentially stale data involved.
Some of these reads are likely to be looking up "static" data. If Hibernate's caching is not reducing database hits for reads, I would consider an in memory cache like Memcached or Redis for this type of data. This same cache could be used by both 10%-Read and 90%-write processes.
For reads that are not static (i.e. reading data you have recently written) Hibernate should hold data in its object cache if its' sized appropriately. Can you determine your cache hit/miss performance?
QUARTZ
If you know for sure that a scheduled job won't impact the same set of data as another job, you could run them against different databases, however if in doubt always perform batch updates to one (primary) server and replicate changes out. It is better to be logically correct, than to introduce replication issues.
DB PARTITIONING
If your 1,000 requests per second are writing a lot of data, look at partitioning your database. You may find you have ever growing tables. Partitioning is one way to address this without archiving data.
Sometimes you need little or no change to your application code.
Archiving is obviously another option
Disclaimer: Any question like this is always going to be application specific. Always try to keep your architecture as simple as possible.
Since the replication is async, the accepted solution will cause hard to debug and hard to reproduce bugs with the second level cache. This is demonstrated here .
This automated test shows this can lead to manipulate incomplete entity graphs.
The cleanest path is to have one EntityManagerFactory per DataSource.
If I correctly understand, 90% of the HTTP requests to your webapp involve at least one write and have to operate on master database. You can direct read only transactions to the copy database, but the improvement will only affect 10% of global databases operation and even those read only operations will hit a database.
The common architecture here is to use a good database cache (Infinispan or Ehcache). If you can offer a big enough cache, you can hope the a good part of the database reads only hit the cache and become memory only operations, either being part of a read only transaction or not. Cache tuning is a delicate operation, but IMHO is necessary to achieve high performance gain. Those cache even allow for distributed front ends even if the configuration is a bit harder in that case (you might have to look for Terracotta clusters if you want to use Ehcache).
Currently, database replication is mainly used to secure the data, and is used as an concurrency improvement mechanizme only if you have high parts of the Information Systems that only read data - and it is not what you are describing.
You can also run a proxySQL infront of your DB nodes (Can be a galera cluster setup), and set query read write split rules, the proxy will distribute traffic according to the defined rule. Ex: SELECT query routed to read node whereas UPDATE queries or read-write transaction goes to write node.
I think the question is general, not sure why the preferred answer steers it into Spring internals? Anyway, you may want to take a look Apache ShardingSphere, which has this feature:
Read/write Splitting
---------------------
Read/write splitting can be used to cope with business access with high stress. ShardingSphere provides flexible read/write splitting capabilities and can achieve read access load balancing based on the understanding of SQL semantics and the ability to perceive the underlying database topology.
One thing I am concerned about is the "understanding of SQL semantics" claim, because how would any library "understand" if:select myfunct(1) from dual changes data in the function, or not.
I am working on a desktop application built using spring framework and one of the part of the application is not working. I found that the repository class does not have any queries with #Query annotation. I haven't encountered it before.
When I try to open the form that uses this, I get an error that the application is not able to connect to the database. The application has 3 databases specified in the application.properties. I have the following questions:
1) How does the following code work without a query specified with #Query annotation. Or where is the query written.
#Repository
public interface AccountRepository extends JpaRepository<Account, Long> {
List<Account> findAccountsByActiveIsTrueAndAccountTypeEquals(String accountType);
List<Account> findAccountsByAccountTypeLike(String type);
}
2) How do we specify which of the database to search for. For example: I have 3 mysql databases currently connected to my application. I wish to access data from DB1 through my Spring boot application through the usual flow of
UI model-> BE Controller/ Service layer -> Repository(Interface) which (usually) has the query written with #Query. How we specify which database this query goes for ?
For your first question I can answer that the JpaRepository has an internal system that analyses the method name you have written and then generates the query that has to be executed to the database.
The #Query annotation is used when the method name and the generated query is not returning the result you wanted to so you specifically tell the compiler which query should be executed.
As mentioned here: https://docs.spring.io/spring-data/jpa/docs/1.5.0.RELEASE/reference/html/jpa.repositories.html
2.3.1 Query lookup strategies.
The JPA module supports defining a query manually as String or have it being derived from the method name.
Declared queries
Although getting a query derived from the method name is quite convenient, one might face the situation in which either the method name parser does not support the keyword one wants to use or the method name would get unnecessarily ugly. So you can either use JPA named queries through a naming convention (see Section 2.3.3, “Using JPA NamedQueries” for more information) or rather annotate your query method with #Query (see Section 2.3.4, “Using #Query” for details).
So basically using a naming convention will do the magic.
Also an interesting question and perfect answer can be found here:
How are Spring Data repositories actually implemented?
For your second question you can refer to this example:
https://www.baeldung.com/spring-data-jpa-multiple-databases
It might be a bit complicated in the beginning but eventually it will work.
He use JPA, JpaRepository has CRUD methodes
https://docs.spring.io/spring-data/jpa/docs/current/reference/html/#reference
In your application.properties, you can put your mysql DB info
Why this works without #Query?
Because you are using JpaRepository which provides an easy way to get data based on your entity and it's fields.
Here your Account will have active, accountType etc fields. You can use JPA's query creation keywords such as AND, OR, Equals, Like and many more.
Derived queries with the predicates IsStartingWith, StartingWith, StartsWith, IsEndingWith", EndingWith, EndsWith, IsNotContaining, NotContaining, NotContains, IsContaining, Containing, Contains the respective arguments for these queries will get sanitized. This means if the arguments actually contain characters recognized by LIKE as wildcards these will get escaped so they match only as literals. The escape character used can be configured by setting the escapeCharacter of the #EnableJpaRepositories annotation.
How do we specify which of the database to search?
You can create configuration classes based on your databases and define data sources based on that using #PropertySource.
For more details see example here
#Configuration
#PropertySource({ "classpath:persistence-multiple-db.properties" })
#EnableJpaRepositories(
basePackages = "com.baeldung.multipledb.dao.product",
entityManagerFactoryRef = "productEntityManager",
transactionManagerRef = "productTransactionManager"
)
I am new to Java and started with Spring Boot and Spring Data JPA, so I know 2 ways on how to fetch data:
by Repository layer, with Literal method naming: FindOneByCity(String city);
by custom repo, with #Query annotation: #Query('select * from table where city like ?');
Both ways are statical designed.
How should I do to get data of a query that I have to build at run time?
What I am trying to achieve is the possibility to create dynamic reports without touching the code. A table would have records of reports with names and SQl queries with default parameters like begin_date, end_date etc, but with a variety of bodies. Example:
"Sales report by payment method" | select * from sales where met_pay = %pay_method% and date is between %begin_date% and %end_date%;
The Criteria API is mainly designed for that.
It provides an alternative way to define JPA queries.
With it you could build dynamic queries according to data provided at runtime.
To use it, you will need to create a custom repository implementation ant not only an interface.
You will indeed need to inject an EntityManager to create needed objects to create and execute the CriteriaQuery.
You will of course have to write boiler plate code to build the query and execute it.
This section explains how to create a custom repository with Spring Boot.
About your edit :
What I am trying to achieve is the possibility to create dynamic
reports without touching the code. A table would have records of
reports with names and SQl queries with default parameters like
begin_date, end_date etc, but with a variety of bodies.
If the queries are written at the hand in a plain text file, Criteria will not be the best choice as JPQL/SQL query and Criteria query are really not written in the same way.
In the Java code, mapping the JPQL/SQL queries defined in a plain text file to a Map<String, String> structure would be more adapted.
But I have some doubts on the feasibility of what you want to do.
Queries may have specific parameters, for some cases, you would not other choice than modifying the code. Specificities in parameters will do query maintainability very hard and error prone. Personally, I would implement the need by allowing the client to select for each field if a condition should be applied.
Then from the implementation side, I would use this user information to build my CriteriaQuery.
And there Criteria will do an excellent job : less code duplication, more adaptability for the query building and in addition more type-checks at compile type.
Spring-data repositories use EntityManager beneath. Repository classes are just another layer for the user not to worry about the details. But if a user wants to get his hands dirty, then of course spring wouldn't mind.
That is when you can use EntityManager directly.
Let us assume you have a Repository Class like AbcRepository
interface AbcRepository extends JpaRepository<Abc, String> {
}
You can create a custom repository like
interface CustomizedAbcRepository {
void someCustomMethod(User user);
}
The implementation class looks like
class CustomizedAbcRepositoryImpl implements CustomizedAbcRepository {
#Autowired
EntityManager entityManager;
public void someCustomMethod(User user) {
// You can build your custom query using Criteria or Criteria Builder
// and then use that in entityManager methods
}
}
Just a word of caution, the naming of the Customized interface and Customized implementating class is very important
In last versions of Spring Data was added ability to use JPA Criteria API. For more information see blog post https://jverhoelen.github.io/spring-data-queries-jpa-criteria-api/ .
I am running a complex query via the Hibernate Criteria API. During debug, I would like to be able to extract and log the parameters that have been bound to the criteria object. Using Hibernate's org.Hibernate.type logger is not an option because during the server start there are many many queries that are run and the logger causes a serious performance hit, and as we are using Hibernate 3.5, it cannot be configured to be turned on before and after the specific method call, only when the server starts.
As far as getting the SQL query itself, in this answer someone posted excellent code that allows for extracting the SQL from a criteria, is there a similar solution for the bound parameters?
You can log the Criteria and the Restrictions will be displayed as well:
Criteria criteria = session.createCriteria(Post.class)
.add(Restrictions.eq("title", "post"));
LOGGER.info("Criteria: {}", criteria);
will display:
Criteria: CriteriaImpl(com.vladmihalcea.book.hpjp.hibernate.association.AllAssociationTest$Post:this[][title=post])
I have a quite heavy java webapp that serves thousands of requests/sec and it uses a master Postgresql db which replicates itself to one secondary (read-only) database using streaming (asynchronous) replication.
So, I separate the request from primary to secondary(read-only) using URLs to avoid read-only calls to bug primary database considering replication time is minimal.
NOTE: I use one sessionFactory with a RoutingDataSource provided by spring that looks up db to use based on a key. I am interested in multitenancy as I am using hibernate 4.3.4 that supports it.
I have two questions:
I dont think splitting on the basis of URLs is efficient as I can
only move 10% of traffic around means there are not many read-only
URLs. What approach should I consider?
May be,somehow, on the basis of URLs I achieve some level of
distribution among both nodes but what would I do with my quartz
jobs(that even have separate JVM)? What pragmatic approach should I
take?
I know I might not get a perfect answer here as this really is broad but I just want your opinion for the context.
Dudes I have in my team:
Spring4
Hibernate4
Quartz2.2
Java7 / Tomcat7
Please take interest. Thanks in advance.
Spring transaction routing
First, we will create a DataSourceType Java Enum that defines our transaction routing options:
public enum DataSourceType {
READ_WRITE,
READ_ONLY
}
To route the read-write transactions to the Primary node and read-only transactions to the Replica node, we can define a ReadWriteDataSource that connects to the Primary node and a ReadOnlyDataSource that connect to the Replica node.
The read-write and read-only transaction routing is done by the Spring AbstractRoutingDataSource abstraction, which is implemented by the TransactionRoutingDatasource, as illustrated by the following diagram:
The TransactionRoutingDataSource is very easy to implement and looks as follows:
public class TransactionRoutingDataSource
extends AbstractRoutingDataSource {
#Nullable
#Override
protected Object determineCurrentLookupKey() {
return TransactionSynchronizationManager
.isCurrentTransactionReadOnly() ?
DataSourceType.READ_ONLY :
DataSourceType.READ_WRITE;
}
}
Basically, we inspect the Spring TransactionSynchronizationManager class that stores the current transactional context to check whether the currently running Spring transaction is read-only or not.
The determineCurrentLookupKey method returns the discriminator value that will be used to choose either the read-write or the read-only JDBC DataSource.
Spring read-write and read-only JDBC DataSource configuration
The DataSource configuration looks as follows:
#Configuration
#ComponentScan(
basePackages = "com.vladmihalcea.book.hpjp.util.spring.routing"
)
#PropertySource(
"/META-INF/jdbc-postgresql-replication.properties"
)
public class TransactionRoutingConfiguration
extends AbstractJPAConfiguration {
#Value("${jdbc.url.primary}")
private String primaryUrl;
#Value("${jdbc.url.replica}")
private String replicaUrl;
#Value("${jdbc.username}")
private String username;
#Value("${jdbc.password}")
private String password;
#Bean
public DataSource readWriteDataSource() {
PGSimpleDataSource dataSource = new PGSimpleDataSource();
dataSource.setURL(primaryUrl);
dataSource.setUser(username);
dataSource.setPassword(password);
return connectionPoolDataSource(dataSource);
}
#Bean
public DataSource readOnlyDataSource() {
PGSimpleDataSource dataSource = new PGSimpleDataSource();
dataSource.setURL(replicaUrl);
dataSource.setUser(username);
dataSource.setPassword(password);
return connectionPoolDataSource(dataSource);
}
#Bean
public TransactionRoutingDataSource actualDataSource() {
TransactionRoutingDataSource routingDataSource =
new TransactionRoutingDataSource();
Map<Object, Object> dataSourceMap = new HashMap<>();
dataSourceMap.put(
DataSourceType.READ_WRITE,
readWriteDataSource()
);
dataSourceMap.put(
DataSourceType.READ_ONLY,
readOnlyDataSource()
);
routingDataSource.setTargetDataSources(dataSourceMap);
return routingDataSource;
}
#Override
protected Properties additionalProperties() {
Properties properties = super.additionalProperties();
properties.setProperty(
"hibernate.connection.provider_disables_autocommit",
Boolean.TRUE.toString()
);
return properties;
}
#Override
protected String[] packagesToScan() {
return new String[]{
"com.vladmihalcea.book.hpjp.hibernate.transaction.forum"
};
}
#Override
protected String databaseType() {
return Database.POSTGRESQL.name().toLowerCase();
}
protected HikariConfig hikariConfig(
DataSource dataSource) {
HikariConfig hikariConfig = new HikariConfig();
int cpuCores = Runtime.getRuntime().availableProcessors();
hikariConfig.setMaximumPoolSize(cpuCores * 4);
hikariConfig.setDataSource(dataSource);
hikariConfig.setAutoCommit(false);
return hikariConfig;
}
protected HikariDataSource connectionPoolDataSource(
DataSource dataSource) {
return new HikariDataSource(hikariConfig(dataSource));
}
}
The /META-INF/jdbc-postgresql-replication.properties resource file provides the configuration for the read-write and read-only JDBC DataSource components:
hibernate.dialect=org.hibernate.dialect.PostgreSQL10Dialect
jdbc.url.primary=jdbc:postgresql://localhost:5432/high_performance_java_persistence
jdbc.url.replica=jdbc:postgresql://localhost:5432/high_performance_java_persistence_replica
jdbc.username=postgres
jdbc.password=admin
The jdbc.url.primary property defines the URL of the Primary node while the jdbc.url.replica defines the URL of the Replica node.
The readWriteDataSource Spring component defines the read-write JDBC DataSource while the readOnlyDataSource component define the read-only JDBC DataSource.
Note that both the read-write and read-only data sources use HikariCP for connection pooling.
The actualDataSource acts as a facade for the read-write and read-only data sources and is implemented using the TransactionRoutingDataSource utility.
The readWriteDataSource is registered using the DataSourceType.READ_WRITE key and the readOnlyDataSource using the DataSourceType.READ_ONLY key.
So, when executing a read-write #Transactional method, the readWriteDataSource will be used while when executing a #Transactional(readOnly = true) method, the readOnlyDataSource will be used instead.
Note that the additionalProperties method defines the hibernate.connection.provider_disables_autocommit Hibernate property, which I added to Hibernate to postpone the database acquisition for RESOURCE_LOCAL JPA transactions.
Not only that the hibernate.connection.provider_disables_autocommit allows you to make better use of database connections, but it's the only way we can make this example work since, without this configuration, the connection is acquired prior to calling the determineCurrentLookupKey method TransactionRoutingDataSource.
The remaining Spring components needed for building the JPA EntityManagerFactory are defined by the AbstractJPAConfiguration base class.
Basically, the actualDataSource is further wrapped by DataSource-Proxy and provided to the JPA EntityManagerFactory. You can check the source code on GitHub for more details.
Testing time
To check if the transaction routing works, we are going to enable the PostgreSQL query log by setting the following properties in the postgresql.conf configuration file:
log_min_duration_statement = 0
log_line_prefix = '[%d] '
The log_min_duration_statement property setting is for logging all PostgreSQL statements while the second one adds the database name to the SQL log.
So, when calling the newPost and findAllPostsByTitle methods, like this:
Post post = forumService.newPost(
"High-Performance Java Persistence",
"JDBC", "JPA", "Hibernate"
);
List<Post> posts = forumService.findAllPostsByTitle(
"High-Performance Java Persistence"
);
We can see that PostgreSQL logs the following messages:
[high_performance_java_persistence] LOG: execute <unnamed>:
BEGIN
[high_performance_java_persistence] DETAIL:
parameters: $1 = 'JDBC', $2 = 'JPA', $3 = 'Hibernate'
[high_performance_java_persistence] LOG: execute <unnamed>:
select tag0_.id as id1_4_, tag0_.name as name2_4_
from tag tag0_ where tag0_.name in ($1 , $2 , $3)
[high_performance_java_persistence] LOG: execute <unnamed>:
select nextval ('hibernate_sequence')
[high_performance_java_persistence] DETAIL:
parameters: $1 = 'High-Performance Java Persistence', $2 = '4'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post (title, id) values ($1, $2)
[high_performance_java_persistence] DETAIL:
parameters: $1 = '4', $2 = '1'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post_tag (post_id, tag_id) values ($1, $2)
[high_performance_java_persistence] DETAIL:
parameters: $1 = '4', $2 = '2'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post_tag (post_id, tag_id) values ($1, $2)
[high_performance_java_persistence] DETAIL:
parameters: $1 = '4', $2 = '3'
[high_performance_java_persistence] LOG: execute <unnamed>:
insert into post_tag (post_id, tag_id) values ($1, $2)
[high_performance_java_persistence] LOG: execute S_3:
COMMIT
[high_performance_java_persistence_replica] LOG: execute <unnamed>:
BEGIN
[high_performance_java_persistence_replica] DETAIL:
parameters: $1 = 'High-Performance Java Persistence'
[high_performance_java_persistence_replica] LOG: execute <unnamed>:
select post0_.id as id1_0_, post0_.title as title2_0_
from post post0_ where post0_.title=$1
[high_performance_java_persistence_replica] LOG: execute S_1:
COMMIT
The log statements using the high_performance_java_persistence prefix were executed on the Primary node while the ones using the high_performance_java_persistence_replica on the Replica node.
GitHub Repository
This is not just theory. It's all on GitHub and works like a charm. Use this test case as a reference.
So you can you use it a starting point for your transaction routing solution, as you have a fully-functional example.
Second-level caching
Once you are using replication, you are operating in a distributed environment, so you need to use a distributed caching solution, like Infinispan.
Since we are using replication to distribute traffic to more database nodes, it's obvious that we also have multiple application nodes which have to connect to those database nodes.
Therefore, using the READ_WRITE CacheConcurrencyStrategy in such an environment is a terrible anti-pattern as each distributed node will keep its own copy of the cached entries, leading you to consistency issues even if you didn't use transaction routing.
Not to mention the cold cache issue you'd face if you employed auto-scaling for your application nodes, as they would amplify the database traffic because new nodes would start with a cold cache.
So, if you plan to use transaction routing with the second-level cache mechanism, then you can do better than this.
Use the NONSTRICT_READ_WRITE cache concurrency strategy with a second-level caching provider that can store the cached data in a distributed system of nodes that are readily available even when you create new application nodes.
Conclusion
You need to make sure you set the right size for your connection pools because that can make a huge difference. For this, I recommend using Flexy Pool.
You need to be very diligent and make sure you mark all read-only transactions accordingly. It's unusual that only 10% of your transactions are read-only. Could it be that you have such a write-most application or you are using write transactions where you only issue query statements?
For batch processing, you definitely need read-write transactions, so make sure you enable JDBC batching, like this:
<property name="hibernate.order_updates" value="true"/>
<property name="hibernate.order_inserts" value="true"/>
<property name="hibernate.jdbc.batch_size" value="25"/>
For batching you can also use a separate DataSource that uses a different connection pool that connects to the Primary node.
Just make sure your total connection size of all connection pools is less than the number of connections PostgreSQL has been configured with.
Each batch job must use a dedicated transaction, so make sure you use a reasonable batch size.
More, you want to hold locks and to finish transactions as fast as possible. If the batch processor is using concurrent processing workers, make sure the associated connection pool size is equal to the number of workers, so they don't wait for others to release connections.
You are saying that your application URL's are only 10% read only so the other 90% have at least some form of database writing.
10% READ
You can think about using a CQRS design that may improve your database read performance. It can certainly read from the secondary database, and possibly be made more efficient by designing the queries and domain models specifically for the read/view layer.
You haven't said whether the 10% requests are expensive or not (e.g. running reports)
I would prefer to use a separate sessionFactory if you were to follow the CQRS design as the objects being loaded/cached will most likely be different to those being written.
90% WRITE
As far as the other 90% go, you wouldn't want to read from the secondary database (while writing to the primary) during some write logic as you will not want potentially stale data involved.
Some of these reads are likely to be looking up "static" data. If Hibernate's caching is not reducing database hits for reads, I would consider an in memory cache like Memcached or Redis for this type of data. This same cache could be used by both 10%-Read and 90%-write processes.
For reads that are not static (i.e. reading data you have recently written) Hibernate should hold data in its object cache if its' sized appropriately. Can you determine your cache hit/miss performance?
QUARTZ
If you know for sure that a scheduled job won't impact the same set of data as another job, you could run them against different databases, however if in doubt always perform batch updates to one (primary) server and replicate changes out. It is better to be logically correct, than to introduce replication issues.
DB PARTITIONING
If your 1,000 requests per second are writing a lot of data, look at partitioning your database. You may find you have ever growing tables. Partitioning is one way to address this without archiving data.
Sometimes you need little or no change to your application code.
Archiving is obviously another option
Disclaimer: Any question like this is always going to be application specific. Always try to keep your architecture as simple as possible.
Since the replication is async, the accepted solution will cause hard to debug and hard to reproduce bugs with the second level cache. This is demonstrated here .
This automated test shows this can lead to manipulate incomplete entity graphs.
The cleanest path is to have one EntityManagerFactory per DataSource.
If I correctly understand, 90% of the HTTP requests to your webapp involve at least one write and have to operate on master database. You can direct read only transactions to the copy database, but the improvement will only affect 10% of global databases operation and even those read only operations will hit a database.
The common architecture here is to use a good database cache (Infinispan or Ehcache). If you can offer a big enough cache, you can hope the a good part of the database reads only hit the cache and become memory only operations, either being part of a read only transaction or not. Cache tuning is a delicate operation, but IMHO is necessary to achieve high performance gain. Those cache even allow for distributed front ends even if the configuration is a bit harder in that case (you might have to look for Terracotta clusters if you want to use Ehcache).
Currently, database replication is mainly used to secure the data, and is used as an concurrency improvement mechanizme only if you have high parts of the Information Systems that only read data - and it is not what you are describing.
You can also run a proxySQL infront of your DB nodes (Can be a galera cluster setup), and set query read write split rules, the proxy will distribute traffic according to the defined rule. Ex: SELECT query routed to read node whereas UPDATE queries or read-write transaction goes to write node.
I think the question is general, not sure why the preferred answer steers it into Spring internals? Anyway, you may want to take a look Apache ShardingSphere, which has this feature:
Read/write Splitting
---------------------
Read/write splitting can be used to cope with business access with high stress. ShardingSphere provides flexible read/write splitting capabilities and can achieve read access load balancing based on the understanding of SQL semantics and the ability to perceive the underlying database topology.
One thing I am concerned about is the "understanding of SQL semantics" claim, because how would any library "understand" if:select myfunct(1) from dual changes data in the function, or not.