Supposed I have the following class:
class Example {
List<ObjectA> objectsOfA;
List<ObjectB> objectsOfB;
List<ObjectC> objectsOfC;
....
}
I would usually have these tables:
Example (Id, other-attributes)
ObjectA (some attributes, ExampleId)
ObjectB (some attributes, ExampleId)
ObjectC (some attributes, ExampleId)
If I want to restore an Example-object, I imagine I have two options:
join every table together, resulting in a lot of entries and organizing it in hashmaps to reassemble the objects
loading every Example and for every example doing single requests for the lists of ObjectA, ObjectB, ObjectC.
If the number of Example-entries is low, option2 might be the best. But for every single entry in Example, I need to do x more requests where x is the number of tables in my class.
Otherwise, having everything in a single join requires me to reorganize all the data by myself - creating hashmaps, iterating through data and stuff, which is usually a lot of work in code.
I also get the possibility of lazy loading with option 2.
Do I have other choices? Did I miss something? (Of course I know about ORMs, but I decided to not use them on Android)
If I understood correctly, your question is, which of these ways is better to load data from a main table and related tables:
Load a JOIN of the main table and related tables
Load the data of the main table and related tables separately and join them in Java
Something else
Unless you are extremely constrained for bandwidth,
I'd say option 1 is simple and I would go with that.
It's easier to let the DB do the joining and in Java just map the records to objects.
If saving bandwidth with the database is important,
then option 2 is better,
because every piece of data will be fetched only once without duplication,
as the result of a JOIN in option is essentially denormalized data.
In any case, I recommend to follow Occam's razor: the simplest solution is often the best.
Related
I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.
So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?
How can I use this technique if I need to improve performance?
Denormalization is generally used to either:
Avoid a certain number of queries
Remove some joins
The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.
A quick examples?
Consider a "Posts" and a "Comments" table, for a blog
For each Post, you'll have several lines in the "Comment" table
This means that to display a list of posts with the associated number of comments, you'll have to:
Do one query to list the posts
Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
Which means several queries.
Now, if you add a "number of comments" field into the Posts table:
You only need one query to list the posts
And no need to query the Comments table: the number of comments are already de-normalized to the Posts table.
And only one query that returns one more field is better than more queries.
Now, there are some costs, yes:
First, this costs some place on both disk and in memory, as you have some redundant informations:
The number of comments are stored in the Posts table
And you can also find those number counting on the Comments table
Second, each time someone adds/removes a comment, you have to:
Save/delete the comment, of course
But also, update the corresponding number in the Posts table.
But, if your blog has a lot more people reading than writing comments, this is probably not so bad.
Denormalization is a time-space trade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.
There are other time-space optimizations, such as
denormalized view
precomputed columns
As with any of such approach, this improves reading data (because they are readily available), but updating data becomes more costly (because you need to update the replicated or precomputed data).
The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.
What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.
One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.
Closely related to star schema is snowflake schema, which is a little more complicated.
If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.
Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".
The critical issues in denormalizing are:
Deciding what data to duplicate and why
Planning how to keep the data in synch
Refactoring the queries to use the denormalized fields.
One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.
However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.
Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.
If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:
introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)
you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)
There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.
Consider a database with a properly normalized parent-child relationship.
Let's say the cardinality is an average of 2x1.
You have two tables, Parent, with p rows. Child with 2x p rows.
The join operation means for p parent rows, 2x p child rows must be read. The total number of rows read is p + 2x p.
Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.
Fewer rows == less physical I/O == faster.
As per the last section of this article,
https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.
Benefits of de-normalization over normalization
Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.
When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.
Google Apps Engine offers the Google Datastore as the only NoSQL database (I think it is based on BigTable).
In my application I have a social-like data structure and I want to model it as I would do in a graph database. My application must save heterogeneous objects (users,files,...) and relationships among them (such as user1 OWNS file2, user2 FOLLOWS user3, and so on).
I'm looking for a good way to model this typical situation, and I thought to two families of solutions:
List-based solutions: Any object contains a list of other related objects and the object presence in the list is itself the relationship (as Google said in the JDO part https://developers.google.com/appengine/docs/java/datastore/jdo/relationships).
Graph-based solution: Both nodes and relationships are objects. The objects exist independently from the relationships while each relationship contain a reference to the two (or more) connected objects.
What are strong and weak points of these two approaches?
About approach 1: This is the simpler approach one can think of, and it is also presented in the official documentation but:
Each directed relationship make the object record grow: are there any limitations on the number of the possible relationships given for instance by the object dimension limit?
Is that a JDO feature or also the datastore structure allows that approach to be naturally implemented?
The relationship search time will increase with the list, is this solution suitable for large (million) of relationships?
About approach 2: Each relationship can have a higher level of characterization (it is an object and it can have properties). And I think memory size is not a Google problem, but:
Each relationship requires its own record, so the search time for each related couple will increase as the total number of relationships increase. Is this suitable for large amount of relationships(millions, billions)? I.e. does Google have good tricks to search among records if they are well structured? Or I will be soon in a situation in which if I want to search a friend of User1 called User4 I have to wait seconds?
On the other side each object doesn't increase in dimension as new relationships are added.
Could you help me to find other important points on the two approaches in such a way to chose the best model?
First, the search time in the Datastore does not depend on the number of entities that you store, only on the number of entities that you retrieve. Therefore, if you need to find one relationship object out of a billion, it will take the same time as if you had just one object.
Second, the list approach has a serious limitation called "exploding indexes". You will have to index the property that contains a list to make it searchable. If you ever use a query that references more than just this property, you will run into this issue - google it to understand the implications.
Third, the list approach is much more expensive. Every time you add a new relationship, you will rewrite the entire entity at considerable writing cost. The reading costs will be higher too if you cannot use keys-only queries. With the object approach you can use keys-only queries to find relationships, and such queries are now free.
UPDATE:
If your relationships are directed, you may consider making Relationship entities children of User entities, and using an Object id as an id for a Relationship entity as well. Then your Relationship entity will have no properties at all, which is probably the most cost-efficient solution. You will be able to retrieve all objects owned by a user using keys-only ancestor queries.
I have an AppEngine application and I use both approaches. Which is better depends on two things: the practical limits of how many relationships there can be and how often the relationships change.
NOTE 1: My answer is based on experience with Objectify and heavy use of caching. Mileage may vary with other approaches.
NOTE 2: I've used the term 'id' instead of the proper DataStore term 'name' here. Name would have been confusing and id matches objectify terms better.
Consider users linked to the schools they've attended and vice versa. In this case, you would do both. Link the users to schools with a variation of the 'List' method. Store the list of school ids the user attended as a UserSchoolLinks entity with a different type/kind but with the same id as the user. For example, if the user's id = '6h30n' store a UserSchoolLinks object with id '6h30n'. Load this single entity by key lookup any time you need to get the list of schools for a user.
However, do not do the reverse for the users that attended a school. For that relationship, insert a link entity. Use a combination of the school's id and the user's id for the id of the link entity. Store both id's in the entity as separate properties. For example, the SchoolUserLink for user '6h30n' attending school 'g3g0a3' gets id 'g3g0a3~6h30n' and contains the fields: school=g3g0a3 and user=6h30n. Use a query on the school property to get all the SchoolUserLinks for a school.
Here's why:
Users will see their schools frequently but change them rarely. Using this approach, the user's schools will be cached and won't have to be fetched every time they hit their profile.
Since you will be getting the user's schools via a key lookup, you won't be using a query. Therefore, you won't have to deal with eventual consistency for the user's schools.
Schools may have many users that attended them. By storing this relationship as link entities, we avoid creating a huge single object.
The users that attended a school will change a lot. This way we don't have to write a single, large entity frequently.
By using the id of the User entity as the id for the UserSchoolLinks entity we can fetch the links knowing just the id of the user.
By combining the school id and the user id as the id for the SchoolUser link. We can do a key lookup to see if a user and school are linked. Once again, no need to worry about eventual consistency for that.
By including the user id as a property of the SchoolUserLink we don't need to parse the SchoolUserLink object to get the id of the user. We can also use this field to check consistency between both directions and have a fallback in case somehow people are attending hundreds of schools.
Downsides:
1. This approach violates the DRY principle. Seems like the least of evils here.
2. We still have to use a query to get the users who attended a school. That means dealing with eventual consistency.
Don't forget Update the UserSchoolLinks entity and add/remove the SchoolUserLink entity in a transaction.
You question is too complex but I try explain the best solution (I will answer in Python but same can be done in Java).
class User(db.User):
followers = db.StringListProperty()
Simple add follower.
user = User.get(key)
user.followers.append(str(followerKey))
This allow fast query who is followed and followers
User.all().filter('followers', followerKey) # -> followed
This query i/o costly so you can make it faster but more complicated and costly in i/o writes:
class User(db.User):
followers = db.StringListProperty()
follows = db.StringListProperty()
Whatever this is complicated during changes since delete of Users need update follows so you need 2 writes.
You can also store relationships but it is the worse scenario since it is more complex than second example with followers and follows ... - keep in mind than entity can have 1Mb it is not limit but can be.
I have a database with 3 tables. The main table is Contract, and it is joined with pairs of keys from two tables: Languages and Regions.
each pair is unique, but it is possible that one contract will have the following pair ids:
{ (1,1), (1,2), (2,1), (2,2) }
Today, the three tables are linked via a connecting entity called ContractLanguages. It contains a sequence id, and triplets of ids from the three tables.
However, in large enough contracts this causes a serious performance issue, as the hibernate environment creates a staggering amount of objects.
Therefore, we would like to remove this connecting entity, so that Contract will hold some collection of these pairs.
Our proposed solution: create an #embeddable class containing the Language and Region id's, and store them in the Contract entity.
The idea behind this is that there is a relatively small number of languages and regions.
We are assuming that hibernate manages a list of such pairs and does not create duplicates, therefore substantially reducing the amount of objects created.
However, we have the following questions:
Will this solution work? Will hibernate know to create the correct object?
Assuming the solution works (the link is created correctly), will hibernate optimize the object creation to stop creating duplicate objects?
If this solution does not work, how do we solve the problem mentioned above without a connecting entity?
From your post and comments I assume the following situation, please correct me if I'm wrong:
You have a limited set of Languages + Regions combinations (currently modelled as ContractLanguages entities)
You have a huge amount of Contract entities
Each contract can reference multiple Languages and Regions
You have problems loading all the contract languages because currently the combination consists of contract + language + region
Based on those assumptions, several possible optimizations come to my mind:
You could create a LanguageRegion entity which has a unique id and each contract references a set of those. That way you'd get one more table but Hibernate would just create one entity per LanguageRegion and load it once per session, even if multiple contracts would reference it. For that to work correctly you should employ lazy loading and maybe load those LanguageRegion entities into the first level cache before loading the contracts.
Alternatively you could just load columns that are needed, i.e. just load parts of an entity. You'd employ lazy loading as well but wouldn't access the contract languages directly but load them in a separate query, e.g. (names are guessed)
SELECT c.id, lang.id, lang.name, region.id, region.name FROM Contract c
JOIN c.contractlangues cl
JOIN cl.language lang
JOIN cl.region region
WHERE c.id in (:contractIds)
Then you load the contracts, get their ids, load the language and region details using that query (it returns a List<Object[]> with the object array containing the column values as selected. You put those into an appropriate data structure and access them as needed. That way you'd bypass entity creation and just get the data that is needed.
I'm fairly new to the app-engine datastore but get that it is designed more like a Hashtable than a database table. This leads me to think it's better to have fewer rows (entities) and more columns (object properties) "in general".
That is, you can create a Car object with properties color and count or you can create it with properties redCount, blueCount, greenCount, assuming you know all the colors (dimensions). If you are storing instances of those objects you would have either three or one:
For each color and count, save new entity:
"red", 3
"blue", 8
"green", 4
Or save one entity with properties for each possible color: 3, 8, 4
Obviously there are some design challenges with the latter approach but wondering what the advice is for getting out of relational thinking? Seems datastore is quite happy with hundreds of "columns" / properties.
Good job trying to get out of relational thinking. It's good to move away from the row/table thinking.
A closer approximation, at least on the programming side, would be to think of entities as data structure or class instances stored remotely. These entities have properties. Separate from the entities are indexes, which essentially store lists of entities that match certain criteria for properties.
When you write an entity, the datastore updates that instance in memory/storage, and then updates all the indexes.
When you do a query, you essentially walk through one of the index lists.
That should give you a basic framework to think about the datastore.
When you design for the datastore, you generally have to design for cost, and to a lesser degree, performance. On the write side, you want to minimize the number of indexes. On the read side, you want to minimize the number of entities you're reading, so the idea of having separate entities for red, blue, green could be a bad idea, tripling your read costs if you constantly need to read back the number of red/blue/green cars. There could be some really obscure corner case where this makes sense.
Your design considerations generally should go along the lines of:
What types of queries do I need to do?
How do I structure my data to make these queries easy to do (since the GAE query capabilities are limited)? Would a query be easier if I duplicate data somehow, and would I be able to maintain this duplicated data on my own?
How can I minimize the number of indexes that need to be updated when I update an entity?
Are there any special cases where I must have full consistency and therefore need to adjust the structure so that consistent queries can be made?
Are there any write performance cases I need to be careful about.
Without knowing exactly what kind of query you're going to make, this answer will likely not be right, but it should illustrate how you might want to think of this.
I'll assume you have an application where people register their cars, and you have some dashboard that polls the datastore and displays the number of cars of each color, the traditional mechanism of having a Car class with color, count attributes still makes sense because it minimizes the number of indexed properties, thus reducing your write costs.
It's a bit of an odd example, because I can't tell if you want to just have a single entity that keeps track of your counts (in which case you don't even need to do a query, you can just fetch that count), or if you have a number of entities of counts that you may fetch and sum up.
If user updates modify the same entity though, you might run into performance problems, you should read through this: https://developers.google.com/appengine/articles/sharding_counters
I would recommend not storing things in your own standard in the one cell. Unless it is encoded in JSON or something similar.
{'red':3, 'blue':4}
JSON is ok because it can be easily decoded into a data structure within java like a list or something.
There is nothing wrong with lots of columns in an app. You will get more gains by having a column for red, blue and green. There would have to be a very large number of columns to see a big slow down.
I think it safe to say that there is no significant performance penalty for having a lot of properties (columns) for each entity (row) in a database model. Nor is there a penalty for lots of rows (entities), or even lots of tables (db classes). If I were doing your example, I would definitely set up separate properties for color and count. We always explicitly call out indexed=False/True to ensure we avoid the dread problem of wondering why your indexes are so large when you only have a few properties indexed (forgetting that the default is True). Although GAE gives you nice properties such as lists that can be indexed, these specialized properties are not without their overhead costs. Understand these well whenever you use them.
One thing that I think is important to remember with GAE when plotting your design is that standard queries are slow, and slow equates to increased latency, and increased latency results in more instances, and more expense (and other frustrations). Before defaulting to a standard query, always ask (if this is a mission-critical part of your code) if you can accomplish the same by setting up a more denormalized datastructure. For example, linking a set of entities together using a common key then doing a series of get_by_id() calls can often be advantageous (be sure to manage ndb's auto memcache when doing this - not everything needs to be cached). Ancestor queries are also much faster than standard queries (but impose a 1 update per second limit on the family group.)
Concluding: within reason the number properties (columns) in an entity (rows) and also the total number of classea (tables) will not impose any real issues. However, if you are coming from a standard relational DB background, your inclination will be to use SQL-like queries to move your logic along. Remember in GAE that standard GQL queries are slow and costly, and always think about things links using denormalization to avoid them. GAE is a big, flat, highly performant noSQL-like resource. Use it as such. Take the extra time to avoid reliance on GQL queries, it will be worth it.
What is the convention for this? Say for example I have the following, where an item bid can only be a bid on one item:
public class Item {
#OneToMany(mappedBy="item", nullable="false"
Set<ItemBid> itemBids = new HashSet<ItemBid>()
}
If I am given the name of the item bidder (which is stored in ItemBid) should I A) Load the club using a club dao and iterate over over the collection of it's itemBids until I find the one with the name I want, or B ) Create an ItemBid dao where the club and item bid name are used in criteria or HQL.
I would presume that B) would be the most efficient with very large collections, so would this be standard for retrieving very specific items from large collections? If so, can I have a general guideline as to what reasons I should be using the collections, and what time I should be using DAO's / Criteria?
Yes, you should definitely query bids directly. Here are the guidelines:
If you are searching for a specific bid, use query
If you need a subset of bids, use query
If you want to display all the bids for a given item - it depends. If the number of bids is reasonably small, fetch an item and use collection. Otherwise - query directly.
Of course from OO perspective you should always use a collection (preferably having findBy*() methods in Item accessing bids collection internally) - which is also more convenient. However if the number of bids per item is significant, the cost of (even lazy-) loading will be significant and you will soon run out of memory. This approach is also very wasteful.
You should be asking yourself this question much sooner: by the time you were doing the mapping. Mapping for ORM should be an intellectual work, not a matter of copying all the foreign keys onto attributes on both sides. (if only because of YAGNI, but there are many other good reasons)
Chances are, the bid-item mapping would be better as unidirectional (then again, maybe not).
In many cases we find that certain entities are strongly associated with an almost fixed number of some other entities (they would probably be called "aggregates" in DDD parlance). For example invoices and invoice items. Or a person and a list of his hobbies. Or a post and a set of tags for this post. We do not expect that the number of items in a given invoice will grow over time, nor will the number of tags. So they are all good places to map a #OneToMany. On the other hand, the number of invoices for each client will be growing - so we would just map an unidirectional #ManyToOne from client an invoice - and query.
Repositories (daos, whatever) that do queries are perfectly good OO (nothing wrong with a query; it is just an object describing your requirements in a storage-neutral way); using finders in entities - not so. From practical point of view it binds your entities to data access layer (DAOs or even JPA classes), and this will make them unusable in many use cases (GWT) or tricky to use when detached (you will have to guess which methods work outside session). From the philosophical point of view - it violates the single responsibility principle and changes your JPA entities into a sort of active record wannabe.
So, my answer would be:
if you need a single bid, query directly,
if you want to display all the bids for a given item - fetch an item and use the collection. This does not depend on the number of bids per item, as the query performed by JPA will be identical as a query you might perform yourself. If this approach needs tuning (like in a case where you need to fetch a lot of items and want to avoid the "N + 1 selects problem") then there is plenty of ways (join fetch, eager fetching, hints) to make it right, without changing the part of the code that uses getBids().
The simplest way to think about it is: if you think that some collection will never be displayed with paging (like tags on post, items on invoice, hobbies on person), map it with #OneToMany and access as a collection.