Can anybody give me an intuition for the following situation?
We are modifying our Microsoft SQL server with EclipseLink. Our application contains an import functionality, so that from these imports a lot of insertions into a table (say table A) are generated. On the other hand, when we delete one of these entries of the table, we do not actually delete it but set a "deleted" flag to true. Among different tables (say tables X,Y,Z), there are relations, so that if we delete an element from X, we also delete some elements of A. Important is that we insert and delete every element separately for now. Given a lot of imports and and a lot of such bulk updates (our deletions), our table A grows a lot, so that currently it is of size 500'000.
Using the EclipseLink PerformanceProfiler, I found out that we get a slow-down in update times, but no visible slow-down for the insertion time. Since we have an index on the composite primary key of table A, I expect that insertion needs O(1) time for index insertion (assuming that the index is similar to a HashMap) and something like O(1) to insert into the table. For the update, I get O(1) for retrieval, and O(1) for the write.
In reality we get an insertion time of 0.6 ms for one element, but an update time of 800ms (!). Does anybody have an explanation for this? Also, if anybody knows good measures to improve this situation, I am happy to hear those as well.
So far I am only aware of bulk updating those elements together that meet a certain condition such as: update all elements of table A that are related to element x of table X. But since we have a lot more hierarchy over multiple tables, I am not sure how much I get from this.
Related
I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.
So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?
How can I use this technique if I need to improve performance?
Denormalization is generally used to either:
Avoid a certain number of queries
Remove some joins
The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.
A quick examples?
Consider a "Posts" and a "Comments" table, for a blog
For each Post, you'll have several lines in the "Comment" table
This means that to display a list of posts with the associated number of comments, you'll have to:
Do one query to list the posts
Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
Which means several queries.
Now, if you add a "number of comments" field into the Posts table:
You only need one query to list the posts
And no need to query the Comments table: the number of comments are already de-normalized to the Posts table.
And only one query that returns one more field is better than more queries.
Now, there are some costs, yes:
First, this costs some place on both disk and in memory, as you have some redundant informations:
The number of comments are stored in the Posts table
And you can also find those number counting on the Comments table
Second, each time someone adds/removes a comment, you have to:
Save/delete the comment, of course
But also, update the corresponding number in the Posts table.
But, if your blog has a lot more people reading than writing comments, this is probably not so bad.
Denormalization is a time-space trade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.
There are other time-space optimizations, such as
denormalized view
precomputed columns
As with any of such approach, this improves reading data (because they are readily available), but updating data becomes more costly (because you need to update the replicated or precomputed data).
The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.
What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.
One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.
Closely related to star schema is snowflake schema, which is a little more complicated.
If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.
Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".
The critical issues in denormalizing are:
Deciding what data to duplicate and why
Planning how to keep the data in synch
Refactoring the queries to use the denormalized fields.
One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.
However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.
Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.
If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:
introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)
you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)
There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.
Consider a database with a properly normalized parent-child relationship.
Let's say the cardinality is an average of 2x1.
You have two tables, Parent, with p rows. Child with 2x p rows.
The join operation means for p parent rows, 2x p child rows must be read. The total number of rows read is p + 2x p.
Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.
Fewer rows == less physical I/O == faster.
As per the last section of this article,
https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.
Benefits of de-normalization over normalization
Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.
When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.
For my website, I'm creating a book database. I have a catalog, with a root node, each node have subnodes, each subnode has documents, each document has versions, and each version is made of several paragraphs.
In order to create this database the fastest possible, I'm first creating the entire tree model, in memory, and then I call session.save(rootNode)
This single save will populate my entire database (at the end when I'm doing a mysqldump on the database it weights 1Go)
The save coasts a lot (more than an hour), and since the database grows with new books and new versions of existing books, it coasts more and more. I would like to optimize this save.
I've tried to increase the batch_size. But it changes nothing since it's a unique save. When I mysqldump a script, and I insert it back into mysql, the operation coast 2 minutes or less.
And when I'm doing a "htop" on the ubuntu machine, I can see the mysql is only using 2 or 3 % CPU. Which means that it's hibernate who's slow.
If someone could give me possible techniques that I could try, or possible leads, it would be great... I already know some of the reasons, why it takes time. If someone wants to discuss it with me, thanks for his help.
Here are some of my problems (I think): For exemple, I have self assigned ids for most of my entities. Because of that, hibernate is checking each time if the line exists before it saves it. I don't need this because, the batch I'm executing, is executed only one, when I create the databse from scratch. The best would be to tell hibernate to ignore the primaryKey rules (like mysqldump does) and reenabeling the key checking once the database has been created. It's just a one shot batch, to initialize my database.
Second problem would be again about the foreign keys. Hibernate inserts lines with null values, then, makes an update in order to make foreign keys work.
About using another technology : I would like to make this batch work with hibernate because after, all my website is working very well with hibernate, and if it's hibernate who creates the databse, I'm sure the naming rules, and every foreign keys will be well created.
Finally, it's a readonly database. (I have a user database, which is using innodb, where I do updates, and insert while my website is running, but the document database is readonly and mYisam)
Here is a exemple of what I'm doing
TreeNode rootNode = new TreeNode();
recursiveLoadSubNodes(rootNode); // This method creates my big tree, in memory only.
hibernateSession.beginTrasaction();
hibernateSession.save(rootNode); // during more than an hour, it saves 1Go of datas : hundreads of sub treeNodes, thousands of documents, tens of thousands paragraphs.
hibernateSession.getTransaction().commit();
It's a little hard to guess what could be the problem here but I could think of 3 things:
Increasing batch_size only might not help because - depending on your model - inserts might be interleaved (i.e. A B A B ...). You can allow Hibernate to reorder inserts and updates so that they can be batched (i.e. A A ... B B ...).Depending on your model this might not work because the inserts might not be batchable. The necessary properties would be hibernate.order_inserts and hibernate.order_updates and a blog post that describes the situation can be found here: https://vladmihalcea.com/how-to-batch-insert-and-update-statements-with-hibernate/
If the entities don't already exist (which seems to be the case) then the problem might be the first level cache. This cache will cause Hibernate to get slower and slower because each time it wants to flush changes it will check all entries in the cache by iterating over them and calling equals() (or something similar). As you can see that will take longer with each new entity that's created.To Fix that you could either try to disable the first level cache (I'd have to look up whether that's possible for write operations and how this is done - or you do that :) ) or try to keep the cache small, e.g. by inserting the books yourself and evicting each book from the first level cache after the insert (you could also go deeper and do that on the document or paragraph level).
It might not actually be Hibernate (or at least not alone) but your DB as well. Note that restoring dumps often removes/disables constraint checks and indices along with other optimizations so comparing that with Hibernate isn't that useful. What you'd need to do is create a bunch of insert statements and then just execute those - ideally via a JDBC batch - on an empty database but with all constraints and indices enabled. That would provide a more accurate benchmark.
Assuming that comparison shows that the plain SQL insert isn't that much faster then you could decide to either keep what you have so far or refactor your batch insert to temporarily disable (or remove and re-create) constraints and indices.
Alternatively you could try not to use Hibernate at all or change your model - if that's possible given your requirements which I don't know. That means you could try to generate and execute the SQL queries yourself, use a NoSQL database or NoSQL storage in a SQL database that supports it - like Postgres.
We're doing something similar, i.e. we have Hibernate entities that contain some complex data which is stored in a JSONB column. Hibernate can read and write that column via a custom usertype but it can't filter (Postgres would support that but we didn't manage to enable the necessary syntax in Hibernate).
Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.
I need insert many rows from many files like:
Identifier NumberValue
For each row I am looing if already exists in database row with Identifier, if exists I will take its NumberValue and add NumberValue from arriving row and update database.
I have found that lookup in database for each row (few millions of records total) takes many time.
Does it make sense create map and look before inserting in database in this map?
Thanks.
I would get the value, add one hundred rows, and add one hundred to the NumberValue in a single transaction.
You can add an Index to the column you are searching on if it's not the Primary Key by using
#Table(indexes = { #Index( columnList = ".." ) })
So basically you're asking if it will be faster to check an in memory map of your entire database in order to potentially save the transaction cost of looking up if something exists, and if not, performing an insert to the database?
The answer of course is "maybe". Despite what you dont want to hear, it really is going to depend on the details of the database that you havent explained to us.
Is it a local one with fast access or perhaps something that's remotely accessed overseas across slow lines.
Are you running on a hefty machine where the amount of memory use really isn't an issue (else you'll end up swapping).
Does the database have indexes and primary keys in place that can quickly search and reject entries if they are duplicates?
Are these running on one server or does each server need to update what was saved to the DB to keep this in memory cache concurrent?
In general, the in memory map will make things work faster. But as I'm sure others can point out, there are a lot of issues and exceptions you'll have to deal with. Reading in a million rows in one go is probably faster than reading in a million rows one at a time in order to check if that particular identifier exists, but again, it really depends on the balance between quantity and resources and time available.
How's that for a non-answer...
I am trying to find out find out How DB index woks and when it should be used. I read some articles on that and one important one i found is at How does database indexing work?.
How it works:-
Advantage2:- After reading the discussion at above link , the one thing index helps is it reduces the number of data blocks to iterate through as explained in example1.
Advantage1:- But again one question came to my mind , after introducing the index also it has to search the index from index table(which any data store makes internally) which should be time again. So after further reading i found out that index are stored in efficient way usually using data structure like B trees thru which can drill down to to any value quickly and after going to node it will give us the exact memory location of record for that value given in where or join condition.Correct? So basically index srores the value of record on which we are creating index and memory location of actual record.
When it should be used:- AS we know if we create index on any column and if we insert/update/delete any value for that column , index needs to be updated for that column in index table. So it will take bit extra time and memory during CUD operation. So when it should be used .Imagine we create a customer one at a time from User screen.So total customer at end of day are 1 million. Now if we want to search customer for whose belongs to NewYork.here index will help a lot. Agreed it will slow down the insert customer a bit, it will be fractionally bad, but performance we will get during retrieval for new york customer will be exceptionally good.
Please correct me if you agree/disagree with above finding?
Your general conclusions are pretty much ok.
Yes, for some queries, an index means less data blocks need to be read.
Yes, the default index type in Oracle is implemented internally using a B-Tree.
Yes, there is some overhead for Create/Update/Delete operations on a table with indexes - both in terms of performance and space used - but this overhead is usually negligible, and easily justified when the improvement to the performance of queries is considered.
I heartily recommend reading the Oracle Concepts Guide on indexes.
Previous responds (and your conclusions) are correct. With regard to when to use indexes, it might be easier to discuss when not to use indexes. Here are a couple of scenarios in which it might not be appropriate to use an index.
A table in which you do a high-rate of inserts, but never or rarely select from it. An example of such a table might be some type of logging table.
A very small table whose rows all fit into one or a couple of blocks.
Indexes speed up selects.
They do this by reducing the number of rows to check.
Example
I have a table with 1,000,000,000 rows.
id is a primary key.
gender can be either male or female
city can be one of 50 options.
street can be lots of different options.
When I'm looking for a unique value, using an index it will take 30 lookups on a fully balanced tree.
Without the index it will take 500,000,000 lookups on average.
However putting an index on gender is pointless, because it will not reduce the search time enough to justify the extra time needed to use the index, lookup the items and than get the data in the rows.
For city it is a border case. If I have 50 different cities a index is useful, if you have only 5 the index has low cardinality and will not get used.
Indexes slow down inserts and updates.
More stuff to consider
MySQL can only use one index per (sub) select per table.
If you want to use an index on:
SELECT * FROM table1 WHERE city = 'New York' AND Street = 'Hoboken'
You will have to declare a compound index:
ALTER TABLE table1 ADD INDEX index_name (city, street)