I am making an app that can lookup creatures and am in the process atempting to increase my knowledge.
I have a table Creatures and a table Skills
A creature and have multiple skills and a skill can be used by multiple creatures.
I am coding in java using sql manager.
I am using 1,2 to represent skills in the creature table and reference the skills table using the numerical values.
One thought I had was is there a way to make an overloaded stored procedure?
I have not started coding yet as I am still planning but would appreciate any ideas sent my way.
I am not trying to avoid the middle table just see if there is a way to do it another way that is not so hard its pointless.
You will probably need the middle table.
Storing a comma-separated list of skills in the Creatures table makes it easy to fetch the skills per creature, but what if you ever want to know the creatures who have a given skill?
Comma-separated lists are fraught with problems. You can use them to optimize one way of accessing the data, but that causes a drastic de-optimization of other ways of accessing the data.
See also my answer to Is storing a delimited list in a database column really that bad?
If you're using a relational database, the "right" and general way to solve it is with a table that will store the relation.
If you want to avoid the middle table, you can put a constraint on the maximum number of skills per creature - let's say max 5 skills, and then have fields called skill1, skill2, ..., skill5. I cannot recommend this option, because it will make querying much more complicated, but for some cases it's possible.
Another improvement of this option, would be a single int or long field, where each bit represents a skill. Still not good in my opinion though.
Related
I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.
So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?
How can I use this technique if I need to improve performance?
Denormalization is generally used to either:
Avoid a certain number of queries
Remove some joins
The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.
A quick examples?
Consider a "Posts" and a "Comments" table, for a blog
For each Post, you'll have several lines in the "Comment" table
This means that to display a list of posts with the associated number of comments, you'll have to:
Do one query to list the posts
Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
Which means several queries.
Now, if you add a "number of comments" field into the Posts table:
You only need one query to list the posts
And no need to query the Comments table: the number of comments are already de-normalized to the Posts table.
And only one query that returns one more field is better than more queries.
Now, there are some costs, yes:
First, this costs some place on both disk and in memory, as you have some redundant informations:
The number of comments are stored in the Posts table
And you can also find those number counting on the Comments table
Second, each time someone adds/removes a comment, you have to:
Save/delete the comment, of course
But also, update the corresponding number in the Posts table.
But, if your blog has a lot more people reading than writing comments, this is probably not so bad.
Denormalization is a time-space trade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.
There are other time-space optimizations, such as
denormalized view
precomputed columns
As with any of such approach, this improves reading data (because they are readily available), but updating data becomes more costly (because you need to update the replicated or precomputed data).
The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.
What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.
One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.
Closely related to star schema is snowflake schema, which is a little more complicated.
If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.
Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".
The critical issues in denormalizing are:
Deciding what data to duplicate and why
Planning how to keep the data in synch
Refactoring the queries to use the denormalized fields.
One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.
However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.
Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.
If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:
introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)
you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)
There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.
Consider a database with a properly normalized parent-child relationship.
Let's say the cardinality is an average of 2x1.
You have two tables, Parent, with p rows. Child with 2x p rows.
The join operation means for p parent rows, 2x p child rows must be read. The total number of rows read is p + 2x p.
Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.
Fewer rows == less physical I/O == faster.
As per the last section of this article,
https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.
Benefits of de-normalization over normalization
Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.
When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.
I am wondering how I would store my custom network level in a MySQL table. I could make four columns, 'level', 'exp', 'expreq' and 'total'. Only this will take up four columns, and as I am storing name, rank and other data in the same table it will be too many columns in the end. Are there better ways? Should I make another table?
In a relational data model, and for expansion ability you have to do it in a different table. by which the master can point to the detailed table where you can have as many attributes as you can.
BUT
This has an obvious impact on the memory when it becomes large, in addition to that, this approach is usually being replaced by less-normalized version of the tables by introducing concepts like "Custom Fields"
OR
If it is me, and this table will be accessible by certain programming language, I would store them in JSON format in very simple table. and let the program do the processing overhead
Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.
I'm fairly new to the app-engine datastore but get that it is designed more like a Hashtable than a database table. This leads me to think it's better to have fewer rows (entities) and more columns (object properties) "in general".
That is, you can create a Car object with properties color and count or you can create it with properties redCount, blueCount, greenCount, assuming you know all the colors (dimensions). If you are storing instances of those objects you would have either three or one:
For each color and count, save new entity:
"red", 3
"blue", 8
"green", 4
Or save one entity with properties for each possible color: 3, 8, 4
Obviously there are some design challenges with the latter approach but wondering what the advice is for getting out of relational thinking? Seems datastore is quite happy with hundreds of "columns" / properties.
Good job trying to get out of relational thinking. It's good to move away from the row/table thinking.
A closer approximation, at least on the programming side, would be to think of entities as data structure or class instances stored remotely. These entities have properties. Separate from the entities are indexes, which essentially store lists of entities that match certain criteria for properties.
When you write an entity, the datastore updates that instance in memory/storage, and then updates all the indexes.
When you do a query, you essentially walk through one of the index lists.
That should give you a basic framework to think about the datastore.
When you design for the datastore, you generally have to design for cost, and to a lesser degree, performance. On the write side, you want to minimize the number of indexes. On the read side, you want to minimize the number of entities you're reading, so the idea of having separate entities for red, blue, green could be a bad idea, tripling your read costs if you constantly need to read back the number of red/blue/green cars. There could be some really obscure corner case where this makes sense.
Your design considerations generally should go along the lines of:
What types of queries do I need to do?
How do I structure my data to make these queries easy to do (since the GAE query capabilities are limited)? Would a query be easier if I duplicate data somehow, and would I be able to maintain this duplicated data on my own?
How can I minimize the number of indexes that need to be updated when I update an entity?
Are there any special cases where I must have full consistency and therefore need to adjust the structure so that consistent queries can be made?
Are there any write performance cases I need to be careful about.
Without knowing exactly what kind of query you're going to make, this answer will likely not be right, but it should illustrate how you might want to think of this.
I'll assume you have an application where people register their cars, and you have some dashboard that polls the datastore and displays the number of cars of each color, the traditional mechanism of having a Car class with color, count attributes still makes sense because it minimizes the number of indexed properties, thus reducing your write costs.
It's a bit of an odd example, because I can't tell if you want to just have a single entity that keeps track of your counts (in which case you don't even need to do a query, you can just fetch that count), or if you have a number of entities of counts that you may fetch and sum up.
If user updates modify the same entity though, you might run into performance problems, you should read through this: https://developers.google.com/appengine/articles/sharding_counters
I would recommend not storing things in your own standard in the one cell. Unless it is encoded in JSON or something similar.
{'red':3, 'blue':4}
JSON is ok because it can be easily decoded into a data structure within java like a list or something.
There is nothing wrong with lots of columns in an app. You will get more gains by having a column for red, blue and green. There would have to be a very large number of columns to see a big slow down.
I think it safe to say that there is no significant performance penalty for having a lot of properties (columns) for each entity (row) in a database model. Nor is there a penalty for lots of rows (entities), or even lots of tables (db classes). If I were doing your example, I would definitely set up separate properties for color and count. We always explicitly call out indexed=False/True to ensure we avoid the dread problem of wondering why your indexes are so large when you only have a few properties indexed (forgetting that the default is True). Although GAE gives you nice properties such as lists that can be indexed, these specialized properties are not without their overhead costs. Understand these well whenever you use them.
One thing that I think is important to remember with GAE when plotting your design is that standard queries are slow, and slow equates to increased latency, and increased latency results in more instances, and more expense (and other frustrations). Before defaulting to a standard query, always ask (if this is a mission-critical part of your code) if you can accomplish the same by setting up a more denormalized datastructure. For example, linking a set of entities together using a common key then doing a series of get_by_id() calls can often be advantageous (be sure to manage ndb's auto memcache when doing this - not everything needs to be cached). Ancestor queries are also much faster than standard queries (but impose a 1 update per second limit on the family group.)
Concluding: within reason the number properties (columns) in an entity (rows) and also the total number of classea (tables) will not impose any real issues. However, if you are coming from a standard relational DB background, your inclination will be to use SQL-like queries to move your logic along. Remember in GAE that standard GQL queries are slow and costly, and always think about things links using denormalization to avoid them. GAE is a big, flat, highly performant noSQL-like resource. Use it as such. Take the extra time to avoid reliance on GQL queries, it will be worth it.
I've been using the low level datastore API for App Engine in Java for a while now and I'm trying to figure out the best way to handle one to many relationships. Imagine a one to many relationship like "Any one student can have zero or more computers, but every computer is owned by exactly one student".
The two options are to:
have the student entity store a list of Keys of the computers associated with the student
have the computer entity store a single Key of the student who owns the computer
I have a feeling option two is better but I am curious what other people think.
The advantage of option one is that you can get all the 'manys' back without using a Query. One can ask the datastore for all entities using get() and passing in the stored list of keys. The problem with this approach is that you cannot have the datastore do any sorting of the values that get returned from get(). You must do the sorting yourself. Plus, you have to manage a list rather than a single Key.
Option two seems nice because there is no list to maintain. Also, you can sort by properties of the computer as long as their is an index for that property. Imagine trying to get all the computers for a student where the results are sorted by purchase date. With approach two it is a simple query, no sorting is done in our code (the datastore's index takes care of it)
Sorting is not really hard, but a little more time consuming (~O(nlogn) for a sort) than having a sorted index (~O(n) for going through the index). The tradeoff is an index (space in the datastore) for processing time. As I said my instinct tells me option two is a better general solution because it gives the developer a little more flexibility in getting results back in order at the cost of additional indexes (which with the google pricing model are pretty cheap). Does anyone agree, disagree, or have comments?
Both approaches are valid in different situations, though option two - storing a single reference on the 'many' side - is the more common approach. Which you use depends on how you need to access your data.
Have you considered doing both? Then you could quickly get a list of computers a student owns by key OR use a query which returns results in some sorted order. I don't think maintaining a list of keys on the student model is as intimidating as you think.
Don't underestimate the benefit of fetching entities directly by keys. According to this article, this can be 4-5x faster than queries.