Background
I am working on a future multi-tenant web application that will need to support thousands of users. The app is being built on top of the Java based Play! MVC Framework using JPA/Hibernate and postgreSQL.
I watched Guy Naor's talk on Writing Multi-tenant Applications in Rails in which he talks about a few approaches to multi-tenancy (data isolation decreases as you go down the list):
Each customer has a separate database
One database with separate schemas and tables (table namespaces) for each customer.
One database with 1 set of tables with customer id columns.
I settled on approach #2, where a user id of some sort is parsed out of a request and then used to access that users tablespace. A postgres SET search_path TO customer_schema,public command is given before any query is made to make sure the customer's tables are the target of a query. This is easily done with #Before controller annotations in controller methods in Play! (this is the approach Guy used in his rails example). The search_path in postgres acts exactly like the $PATH does in an OS; awesome!
All this sounded great, but I immediately ran into difficulties in implementing it on top of a JDBC/Hibernate/JPA stack because there doesn't seem to be a way to dynamically switch schemas at runtime.
The Problem
How do I get either JDBC or Hibernate to support dynamically switching postgres schemas at runtime?
It seems database connections are statically configured by a connection factory (see: How to manage many schemas on one database using hibernate). I have found similar questions with similar answers of using multiple SessionFactorys per user, but since I understand SessionFactorys are heavy weight objects so it's implausible that you could support hundreds of users, let alone thousands of users, going this route.
I haven't committed myself completely to approach #2 above, but I haven't quite abandoned it for approach #3 quite yet either.
You can execute the command
SET search_path TO customer_schema,public
as often as you need to, within the same connection / session / transaction. It is just another command like SELECT 1;. More in the manual here.
Of course, you can also preset the search_path per user.
ALTER ROLE foo SET search_path=foo, public;
If every user or many of them have a schema that matches their user name, you can simply go with the default setting in postgresql.conf:
search_path="$user",public;
More ways to set the search_path here:
How does the search_path influence identifier resolution and the "current schema"
As of Hibernate 4.0, multi-tenancy is natively supported at the discriminator (customerID), schema, and database level. See the source code here, and the unit test here.
The difficulty is that, while the unit test's file name is SchemaBasedMultitenancyTest, the actual MultitenancyStrategy used is Database. I can't find any examples on how to make it work based on schema, but maybe the unit test will be enough to go on...
While sharding by schema is common, see this post from the Apartment gem authors covering some drawbacks.
At Citus, we shard via option #3 listed above, and you can read more in our use-case guide in the Documentation.
Related
I would like to create a simple project using spring to control the status of some customers, with different environments. So a customer can have two environments (dev and prod), and others may have one, two or three.
The basic idea is I would like to create a Web Service using spring with the following interface:
localhost:8080/customer1/environment1/status to extract status data from customer1 and environment1.
I have two options:
Using MongoDB, with a database per customer, a collection per environment and inside the status documents. I found the following problems:
I found many solutions on the web, it was for previous versions of Spring (I am using Spring 5)
Also, I am not sure how can I implement dynamic collections (I mean, if I make a request to localhost:8080/customer2/environment2/status, I not only would like to change the database but also the collection dynamically)
Using Postgres, using a schema per customer, and a table per environment (all the tables will have the same structure)
The problem is that the table name can be different (production, development, test and so on), so I should have to implement dynamic tables name in Spring (which I am not sure if it is possible)
I have been searching a couple of days for an easy solution for this (which initially I thought it would be easy, but looks like it is not that easy)
What do you think it would be the best and simpler solution: MongoDB or Postgres?
Can you provide the basics steps to reproduce it, or provide a Github repository with code I could use as a reference?
PS: There is no need to be extra safe because it will be an internal service, so it doesn't matter the location of the customer's data: can be in the same database, or in different databases
First of all, I think your database choice should depends more on which advantages or disadvantages give you one database over the other. Second, I dont believe using a database per user its a good idea, imagine what will happen when you get 5000 users, it will be a pain administrate such amount of databases or keep changing your database every time in your code. I suggest you firstly try to get a compressed database model of your requeriments in a single database and then over that, you can work and select wich database is better for you.
I hope it helps!
I am creating a webapp in Spring Boot (Spring + Hibernate + MySQL).
I have already created all the CRUD operations for the data of my app, and now I need to process the data and create reports.
As per the complexity of these reports, I will create some summary or pre proccesed tables. This way, I can trigger the reports creation once, and then get them efficiently.
My doubt is if I should build all the reports in Java or in Stored Procedures in MySQL.
Pros of doing it in Java:
More logging
More control of the structures (entities, maps, list, etc)
Catching exceptions
If I change my db engine (it would not happen, but never know)
Cons of doing it in Java:
Maybe memory?
Any thoughts on this?
Thanks!
Java. Though both are possible. It depends on what is most important and what skills are available for maintenance and the price of maintaining. Stored procedures are usually very fast, but availability and performance also depends on what exact database you use. You will need special skills, and then you have it all working on that specific database.
Hibernate does come with a special dialect written for every database to get the best performance out of the persistence layer. It’s not that fast as a stored procedure, but it comes pretty close. With Spring Data on top of that, all difficulty is gone. Maintenance will not cost that much and people who know Spring Data are more available than any special database vendor.
You can still create various “difficult” queries easily with HQL, so no block there. But Hibernate comes with more possibilities. You can have your caching done by eh-cache and with Hibernate envers you will have your audit done in no time. That’s the nice thing about this framework. It’s widely used and many free to use maven dependencies are there for the taking. And if in future you want to change your database, you can do it by changing like 3 parameters in your application.properties file when using Spring Data.
You can play with some annotations and see what performs better. For example you have the #Inheritance annotation where you can have some classes end up in the same table or split it to more tables. Also you have the #MappedSuperclass where you can have one JpaObject with the id which all your entities can extend. If you want some more tricks on JPA, maybe check this post with my answer on how to use a superclass and a general repository.
As per the complexity of these reports, I will create some summary or
pre proccesed tables. This way, I can trigger the reports creation
once, and then get them efficiently.
My first thought is, is this required? It seems like adding complexity to the application that perhaps isn't needed. Premature optimisation and all that. Try writing the reports in SQL and running an execution plan. If it's good enough, you have less code to maintain and no added batch jobs to administer. Consider load testing using E.G. jmeter or gatling to see how it holds up under stress.
Consider using querydsl or jooq for reporting. Both provide a database abstraction layer and fluent API for querying databases, which deliver the benefits listed in the "Pros of doing it in Java" section of the question and may be more suited to the problem. This blog post jOOQ vs. Hibernate: When to Choose Which is well worth a read.
I am looking around for a multitenancy solution for my web application.
I would like to implement a application with Separate Schema Model. I am thinking to have a datasource per session. In order to do that i put datasource and entitymanger in session scope , but thats not working. I am thinking to load data-access-context.xml(which include datasource and other repository beans) file when the user entered username and password and tenantId. I would like to know if it is a good solution?
Multitenancy is a bit tricky subject and it has to be handled on the JPA provider side so that from the client code perspective nothing or almost nothing changes. eclipselink has support for multitenancy (see: EclipseLink/Development/Indigo/Multi-Tenancy), hibernate just added it recently.
Another approach is to use AbstractRoutingDataSource, see: Multi tenancy in Hibernate.
Using session-scope is way too risky (also you will end up with thousands of database connections, few for every session/user. Finally EntityManager and underlying database connections are not serializable so you cannot migrate your session and scale your app properly.
I have worked with a number of multi-tenancy systems. The challenge here is how you keep
open architecture and
provide a solution that evolves with your business.
Let's look at second challenge first. Multi-tenancy systems has a tendency to evolve where you'll need to support use cases where same data (record) can be accessed by multiple tenants with different capacity (e.g. https://bugs.eclipse.org/bugs/show_bug.cgi?id=355458). So, the system ultimately needs Access Control List.
To keep the open architecture you can code to a standard (like JPA). Coding to EclipseLink or Hibernate makes me uncomfortable.
Spring Security ACL provides very flexible community supported solution to both these challenges. Give that a try. I did and been happy with it's performance. However, I must caution you, it took me some digging to get my head around it.
I've got an Oracle database that has two schemas in it which are identical. One is essentially the "on" schema, and the other is the "off" schema. We update data in the off schema and then switch the schemas behind an alias which our production servers use. Not a great solution, but it's what I've been given to work with.
My problem is that there is a separate application that will now be streaming data to the database (also handed to me) which is currently only updating the alias, which means it is only updating the "on" schema at any given time. That means that when the schemas get switched, all the data from this separate application vanishes from production (the schema it is in is now the "off" schema).
This application is using Hibernate 3.3.2 to update the database. There's Spring 3.0.6 in the mix as well, but not for the database updates. Finally, we're running on Java 1.6.
Can anyone point me in a direction to updating both "on" and "off" schemas simultaneously that does not involve rewriting the whole DAO layer using Spring JDBC to load two separate connection pools? I have not been able to find anything about getting hibernate to do this. Thanks in advance!
You shouldn't be updating two seperate databases this way, especially from the application's point of view. All it should know/care about is whether or not the data is there, not having to mess with two separate databases.
Frankly, this sounds like you may need to purchase an ETL tool. Even if you can't get it to update the 'on' schema from the 'off' one (fast enough to be practical), you will likely be able to use it to keep the two in sync (mirror changes from 'on' to 'off').
HA-JDBC is a replicating JDBC Driver we investigated for a short while. It will automatically replicate all inserts and updates, and distribute all selects. There are other database specific master-slave solutions as well.
On the other hand, I wouldn't recommend doing this for 4-8 hour procedures. Better lock the database before, update one database, and then backup-restore a copy, and then unlock again.
I'm introducing a DAO layer in our application currently working on SQL Server because I need to port it to Oracle.
I'd like to use Hibernate and write a factory (or use dependency injection) to pick the correct DAOs according to the deployment configuration. What are the best practices in this case? Should I have two packages with different hibernate.cfg.xml and *.hbm.xml files and pick them accordingly in my factory? Is there any chance that my DAOs will work correctly with both DBMS without (too much) hassle?
Assuming that the table names and columns are the same between the two, you should be able to use the same hbm.xml files. However you will certainly need to supply different a Hibernate Configuration value (hibernate.cfg.xml), as you will need to change Hibernate's dialect from SQLServer to Oracle.
If there are slight name differences between the two, then I would create two sets of mapping files - one per Database server - and package these up into separate JARs (such as yourproject-sqlserver-mappings.jar and yourproject-oracle-mappings.jar), and deploy the application with one JAR or the other depending on the environment.
I did this for a client a while back -- at deployment depending on a property set in a production.properties file I changed out the hibernate.dialect in the cfg file using Ant (you can use any xml transformer). However this would only work if the Hibernate code is seamless btw both DBs i.e. no db-specific function calls etc. HQL/JPAQL has standard function calls that help ion this regard like UPPER(s), LENGTH(s) etc.
If the db implementations must necessarily be different then you'd have to do something like what #matt suggested.
I've worked on an app that supports a lot of databases (Oracle, Informix, SQL Server, MySQL). We have one configuration file and one set of mappings. We use jndi for the database connection so we don't have to deal with different connection URLs in the app. When we initialize the SessionFactory we have a method that deduces the type of database from the underlying connection. For example, manually get a connection via JNDI and then use connection.getMetaData().getDatabaseProductName() to find out what the database is. You could also use a container environment variable to explicitly set it. Then set the dialect using configuration.setProperty(Environment.DIALECT, deducedDialect) and initialize the SessionFactory as normal.
Some things you have to deal with:
Primary key generation. We use a customized version of the TableGenerator strategy so we have one key table with columns for table name and next key. This way every database can use the same strategy rather than sequence in Oracle, native for SQL Server, etc.
Functions specific to databases. We avoid them when possible. Hibernate dialects handle the most common ones. Occasionally we'll have to add our own to our custom dialect classes, .e.g. date arithmetic is pretty non-standard, so we'll just make up a function name and map it to each database's way of doing it.
Schema generation - we use the Hibernate schema generation class - it works with the dialects to create the correct DDL for each type of database and forces the database to match the mappings. You have to be aware of the keywords for each database, e.g. don't try to have a USER table in Oracle (USERS will work), or a TRANSLATION table in MySQL.
There is a table mapping the differences between Oracle and SQLServer here: http://psoug.org/reference/sqlserver.html
In my opinion the biggest pitfalls are:
1) Dates. The functions and mechanics are completely different. You will have to use different code for each DB.
2) Key generation - Oracle and SQLServer use different mechanics and if you try to avoid the "native" generation altogether by having your own keys table - well, you just completely serialized all your "inserts". Not good for performance.
3) Concurrency/locking is a bit different. Parts of the code that is performance sensitive will probably be different for each DB.
4) Oracle is case sensitive, SQLServer is not. You need to be careful with that.
There are lots more :)
Writing SQL code that will run on two DBs is challenging. Making it fast can seem nearly impossible at times.