Splitting MySQL Database into separate databases

Splitting MySQL Database into separate databases - java

I have a requirement that the MySQL database being used in my application is scaling very aggressively. I am in no state currently to migrate to a NoSQL Database.
I have figured out the following areas where I can try splitting the current database into multiple databases:
There are some tables which have static content, i.e. it changes barely.
There are user tables which store the user data upon interaction which changes drastically.
Now, if i split the database into two different databases, how will I handle the transaction? How will I write the Data Access Layer, will i have connections to both the databases? The application currently uses Spring & Hibernate for Back End. There are calls which join the user tables and the content tables in the current schema.
The architecture follows the current structure:
Controller -> Service -> DAO Layer.
So, if i am willing to refactor the DAO layer which communicates with the database, what approach should i follow? I know only about Hibernate ORM but i would be willing to letting it go if there is something better than Hibernate.

Multiple databases on the same server? That approach will probably not improve performance on its own. RAM, fast disks, optimization, partitioning, and correct indexing will have a far greater payback.
If you have multiple databases on one server you can connect to them with a single connection, and simply use the database names with the table names in your SQL. Transactions work fine within a single connection.
Transactions across multiple connections and multiple servers are harder. There's a feature in MySQL called XA transactions to help handle this. But it has plenty of overhead, and is therefore most useful for high-value transactions as in banking.
In the jargon of the trade, adding servers is called "scale-out." The alternative is "scale-up," in which you add more RAM, faster direct-access storage, optimization, and other stuff to a single server to get it to do more.
There are several approaches you can take to the scale-out problem. The classic one is to use MySQL to set up a single primary server with multiple load-balanced replica servers.. That's probably the path that's most often taken, so you can do it without reinventing a lot of wheels. In this solution you do all your writing to a single instance. Queries that look up data can use multiple read-only load-balanced instances.
http://dev.mysql.com/doc/refman/5.5/en/replication-solutions-scaleout.html
This is a very popular approach where you have a mix of long-running reporting queries and short-running interactive queries. The reporting can be run on dedicated slave servers.
Another approach is multiple-primary-server replication using MySQL Cluster. https://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-replication-multi-master.html
Another approach, if you have money to spend, is to go with a supported MySQL Cluster. Oracle, MariaDB, and Percona have such products on offer.
Scale-out is a big job no matter how you approach it. There's some documented experience from other people who have done it. For example, https://www.facebook.com/note.php?note_id=23844338919

It sounds like you did not thought about the partition of your database.
You should read something about database normalization first: database normalization
To split the database i would export the sql code from the database, then i would make 2 new files were i copy the tables that i want to have in the specific databases. After that i would import the 2 files in the specific databases.
i think this might help u help me: lets say i want to print reports for a user. the user is persisted in 'user' table and there is a score table which has the user score for every user_id. Now, my plan is to put the user table in one database, and score table in another database, making them two data sources. How can i handle such a scenario?
First to put the tables in different databases make no sence for me and i did not know if there is a ability to make select queries with to different databases mixed.
example: SELECT score, name FROM user, score WHERE score > 100 AND(score.user_id = user.user_id);
I dont no if this fit with two databases i think not.

Related

how to convert existing relational database model into model suitable for a no sql database (like Mongo DB or Amazon Dynamo DB)

I want to modify an existing java shopping cart app to make it work with a nosql database like Amazon Dynamo DB or Mongo DB... But the traditional MySQL db is a relational db- it has composite keys/primary/foreign keys-- In contrast, in Amazon Dynamo DB there is either a single primary key, or a composite primary key comprised of 2 fields...
I have the detailed data model of the relational database...Now how do I go about converting it so that I have a database in Amazon Dynamo DB that is able to make the app work with Dynamo DB(i.e. no Sql database)? Are there any best practices/precautions that have to be kept in mind when doing this? Will this involve lot of work rewriting the application code as well? or can i handle all changes at database level itself, without modifying app's logic?
Also, is there any tool that does most/large part of this work?

There is no automated way for this. NoSQL databases like MongoDB do not map data structures in the same way as MySQL. There are different performance characteristics and different ways how you can store data. In some cases you'd coalesce two SQL tables into one collection where you simply include the joined data in the same document.
How and when you'd do that, all depends on how you logically would group data, but just as much on the sort of workload you're putting on your data. For example, for heavy reads and little writes, you might store the data differently than in the case where you have heavy writes and a few reads.
Besides having to redo the interface from your application to the database, you will also have to re-architecture your data model. That's going to be as much work as designing your SQL structure and it works best not thinking of how you would do it in SQL. NoSQL vs SQL are two totally different beasts, which needs to be treated just as different!

Here is a start: http://mongify.com/ It's not a "fully automatic" solution but it looks like it could be a useful tool to use at least as an 'outline' for reverse engineering a SQl app to work as a MongoDB app.

Justification of the need for an in-memory database

My use case is as follows --
I have a database table with around 1000+ entries and this table is updated/edited infrequently but i expect this to change in future. Some of the columns in the table contain strings that are of considerable length.
Now I am in the process of writing a UI application that will have some mouseover events that will display texts derived from the aforementioned database table.
I have, for my use case, decided to write a backend 'server' that will host an in-memory database that will have all the data that was present in the aforementioned table. The UI app will now, on startup, cache the required data from the in-memory database present or hosted by the backend server.
Does my use case justify using an in-memory database ? If not, what are the alternatives I should consider ?
EDIT 1 --
My use case also involves running multiple searches of varying complexity on the database very frequently.
Thanks
p1ng

Seems like an excellent use-case for an in-memory database. Writing it yourself, on the other hand, is probably not the way to go.
There are plenty of existing options for just about any imaginable scenario: http://en.wikipedia.org/wiki/In-memory_database
If you're doing complex searches on text data, Lucene is quite excellent. It has special in-memory storage backends, but really, it doesn't matter for such a tiny dataset - it will always be quickly cached anyway.

Designing an account statement

I have a mySQl innodb database which has a couple of tables which store different kind of transactions of a user. In order to show a custom 'Account Statement', I have to fetch data from all of these tables every time a user wishes to see the Account Statement.
I am not sure what would be an optimized approach.
There are a lot of users (and the data keeps changing in real time) and I'm not sure if I should keep caching the sql queries.
Should I create views that combine the table and keep updating it whenever there is an update to the parent table?
Should I perform a join on these multiple tables each time a user requests for the account statement?
I was not able to find out if there is a standard design/practice for showing account statement (with pagination). Any suggestions?
Thank you.

I would recommend to start to create a JPA mapping of your tables and then using some "standard" provider (eg. Hibernate) to access your data. This will makes transparent access from Java to your data without thinking (too much) about views, etc.
Your scenario seems very common and is exactly what RDBMS are for. Do not hesitate for performance now, when going to start your first project (if it is not your first project, this question has no sense).

Is a good idea do processing of a large amount of data directly on database?

I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?

My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.

Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...

Strategies for designing a database (being accessed by hibernate) which will have a lot of archivial data

I am developing an application which will be integrated with thousands of sensors sending information at every 15 minute interval. Let's assume that the format of the data for all sensors is same. What is the best strategy of storing this data so that every thing is archived (is accessible) and does not have a negative impact due to large size of growing data.
Th question is related to general database design I suppose, but I would like to mention that I am using Hibernate (with Spring Roo) so perhaps there is some thing already out there addressing it.
Edit: sensors are dumb, and off the shelf. It is not possible to extend them. In the case of a network outage all information is lost. Since the sensors work on GPRS this scenario will be some what unlikely (as the GPRS provider is a rather good one here in Sweden, but yes it can go down and one can do nothing about it).
A queuing mechanism was foremost in consideration and spring roo provides easy to work with prototype code based on ACTIVEMQ.

I'd have a couple of concerns about this design:
Hibernate is an ORM tool. It demands an object model on one side and a relational one on the other. Do you have an object representation? If not, I'd say that Hibernate isn't the way to go. If it's a simple table mapping mechanism you'll be fine.
Your situation sounds like war: long periods of boredom surrounded by instants of sheer terror. I don't know if your design uses asynchronous mechanisms between the receipt of the sensor data and the back end, but I'd want to have some kind of persistent queuing mechanism to guarantee delivery of all the data and an orderly line while they were waiting to be persisted. As long as you don't need to access the data in real time, a queue will guarantee delivery and make sure you don't have thousands of requests showing up at a bottleneck at the same time.
How are you time stamping the sensor items as they come in? You might want to use a column that goes down to nanoseconds to get these right.
Are the sensors event-driven or timed?
Sounds like a great problem. Good luck.

Let's assume you have 10,000 sensor sending information every 15 minutes. To have better performance on database side you may have to partition your database possibly by date/time, sensor type or category or some other factor. This also depend on how you will be query your data.
http://en.wikipedia.org/wiki/Partition_(database)
Other bottle neck would be your Java/Java EE application itself. This depends on your business like, are all 150,000 sensors gonna send information at same time? and what architecture your java application gonna follow. You will have to read articles on high scalablity and performance.
Here is my recommendation for Java/Java EE solution.
Instead of single, have a cluster of applications receiving the data.
Have a controller application that controls link between which sensor sends data to which instance of application in the cluster. Application instance may pull data from sensor or sensor can push data to an application instance but controller is the one who will control which application instance is linked to which set of sensors. This controller must be dynamic such that sensors can be added or removed or updated as well application instances can join or leave cluster at any time. Make sure that you have some fail over capability into your controller.
So if you have 10,000 sensors and 10 instances of application in cluster, you have 1000 sensors linked to an application at any given time. If you still want better performance, you can have say 20 instances of application in cluster and you will have 500 sensors linked to an application instance.
Application instances can be hosted on same or multiple machines so that vertical as well as horizontal scalability is achieved. Each application instance will be multi threaded and have a local persistence. This will avoid bottle neck on to main database server and decrease your transaction response time. This local persistence can be a SAN file(s) or local RDBMS (like Java DB) or even MQ. If you persist locally in database, then you can use Hibernate for same.
Asynchronously move data from local persistence to main database. This depends on how have you persisted data locally.
If you use file based persistence, you need a separate thread that reads data from file and inserts in main database repository.
If you use a local database then this thread can use Hibernate to read data locally and insert it on main database repository.
If you use MQ, you can have thread or separate application to move data from queue to main database repository.
Drawback to this solution is that there will be some lag between sensor having reported some data and that data appearing in main database.
Advantage in this solution is that it will give you high performance, scalability, and fail-over.

This means you are going to get about 1 record/second multiplied by how many thousand sensors you have, or about 2.5 million rows/month multiplied by how many thousand sensors you have.
Postgres has inheritance and partitioning. That would make it practical to have tables like:
sensordata_current
sensordata_2010_01
sensordata_2009_12
sensordata_2009_11
sensordata_2009_10
.
.
.
each table containing measurements for one month. Then a parent table sensordata can be created that "consists" of these child tables, meaning queries against sensordata would automatically go through the child tables, but only the ones which the planner deduces can contain data for that query. So if you say partitioned your data by months (which is a date range), and you expressed that wish with a date constraint on each child table, and you query by date range, then the planner - based on the child table constraints - will be able to exclude those child tables from execution of the query which do not contain rows satisfying the date range.
When a month is complete (say 2010 Jan just turned 2010 Feb), you would rename sensordata_current to the just completed month (2010_01), create a new sensordata_current, move over any rows from 2010_01 into the newly created sensordata_current that have a timestamp in Feb, add finally a constraint to 2010_01 that expresses that it only has data in 2010 Jan. Also drop unneeded indices on 2010_01. In Postgres this all can be made atomic by enclosing it into a transaction.
Alternatively, you might need to leave _current alone, and create a new 2010_01 and move over all January rows into it from _current (then optionally vacuum _current to immediately reclaim the space - though if your rows are consant size, with recent Postgres versions there is not much point in doing that). Your move (SELECT INTO / DELETE) will take longer in this case, but you won't have to write code to recreate indices, and this would also preserve other details (referential integrity, etc.).
With this setup removing old data is as quick and efficient as dropping child tables. And migrating away old data is efficient too since child tables are also accessible directly.
For more details see Postgres data partitioning.

Is it a requirement that these sensors connect directly to an application to upload their data? And this application is responsible for writing the data to the database?
I would consider having the sensors write data to a message queue instead, and having your "write to DB" application be responsible for picking up new data from the queue and writing it to the database. This seems like a pretty classic example of "message producers" and "message consumers", i.e. the sensors and the application, respectively.
This way, the sensors are not affected if your "write to DB" application has any downtime, or if it has any performance issues or slowdowns from the database, etc. This would also allow you to scale up the number of message consumers in the future without affecting the sensors at all.
It might seem like this type of solution simply moves the possible point of failure from your consumer application to a message queue, but there are several options for making the queue fault-reliant - clustering, persistent message storage, etc.
Apache MQ is a popular message queue system in the Java world.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.