Datastore design - How to simulate efficient joins

Datastore design - How to simulate efficient joins - java

I've got a design question regarding Google's database Cloud Datastore. Let me explain it by using an example:
I've got Entities of the kind "Article" with the following properties:
title
userId
....
sumOfScore
SumOfScore should be the sum of all related "Score" entities, which have
properties like:
articleId
userId
score
In Pseudo-SQL:
sumOfScore = select sum(score) from Score where score.articleId = article.id
I see two possibilities to design this (using Google' datastore API):
1.) No property sumOfScore for Articles; but query always:
This means: Every time an article is read, I need to do an query for this specific article for calculating the sumOfScore.
Imagine a list of 100 Articles that is shown to a user. This would need additional 100 queries to the database, just to show the score for each article.
Nevertheless: This would be my preferred way when using a Relational-DB. No redundancy and good normalization.
And with SQL you can use just one join-select to catch all data.
But it doesn't feel right for Cloud Datastore.
2.) Calculate the sumOfScore whenever Score entities are changed:
This means: Whenever a Score-Entity is added, removed or changed, the related Article
updates the sumOfScore property.
Advantage: When reading articles no additional queries are needed. The sumOfScore is redundant on the entity itself.
Disadvantage: Every time a score is changed, there is one additional query and an additional write (updating an Article entity). And sumOfScore may mismatch with the actual Score entities (e.g. value is changed via DB-Console)
What are more experienced people think? Is there a common best practice for such scenario?
What are doing the JPA or JDO implementation under the hood?
Thanks a lot
Mos

The first thing I recommend you look into the GAE article about sharding counters.
That is an article from the GAE best practices relating to how you should be handling counters/sums. It can be a little tricky because every time you update an element you have to use logic to randomly pick a sharded counter; and when you retrieve your count you're actually fetching a group of entities and summing them. I've gone this route but won't provide code here on how I did it because I haven't battle tested it yet. But your code can get sloppy in a hurry if you just copy/paste the sample sharding code all over the place, so make an abstract or typed counter class to reuse your sharding logic if you decide to go this route.
Another alternative would be to use a fuzzy count. This method uses memcache and offers better performance at the cost of accuracy.
See the section here labeled "Transient and frequently updated data"
And the last alternative; is to just use SQL. Its experimental and hot out of the oven (in relation to being used on GAE) but it might be worth looking into.

Theres third possibility which doesn't make a compromise.
You make Score a child of Article, and keep the sumOfScore in Article. For sorting purposes, this field will come in handy. As this two classes are from the same entity group, you can create a Score and update the Article in a transaction. You could even double check by querying all the Score who's parent is a given Article.
The problem with this approach, is that you can only update an entity 5 times per second. If you think you'll have much more activity than that (remember, it's just a limitation on a single entity not the entier table), you should check out sharded counter tutorial or see the google io's video explaining this..
edit:
Heres a great discussion about this same topic: How does Google Moderator avoid contention?

Related

How to implement efficiently unread-read message count in Java based application [duplicate]

I heard a lot about denormalization which was made to improve performance of certain application. But I've never tried to do anything related.
So, I'm just curious, which places in normalized DB makes performance worse or in other words, what are denormalization principles?
How can I use this technique if I need to improve performance?

Denormalization is generally used to either:
Avoid a certain number of queries
Remove some joins
The basic idea of denormalization is that you'll add redundant data, or group some, to be able to get those data more easily -- at a smaller cost; which is better for performances.
A quick examples?
Consider a "Posts" and a "Comments" table, for a blog
For each Post, you'll have several lines in the "Comment" table
This means that to display a list of posts with the associated number of comments, you'll have to:
Do one query to list the posts
Do one query per post to count how many comments it has (Yes, those can be merged into only one, to get the number for all posts at once)
Which means several queries.
Now, if you add a "number of comments" field into the Posts table:
You only need one query to list the posts
And no need to query the Comments table: the number of comments are already de-normalized to the Posts table.
And only one query that returns one more field is better than more queries.
Now, there are some costs, yes:
First, this costs some place on both disk and in memory, as you have some redundant informations:
The number of comments are stored in the Posts table
And you can also find those number counting on the Comments table
Second, each time someone adds/removes a comment, you have to:
Save/delete the comment, of course
But also, update the corresponding number in the Posts table.
But, if your blog has a lot more people reading than writing comments, this is probably not so bad.

Denormalization is a time-space trade-off. Normalized data takes less space, but may require join to construct the desired result set, hence more time. If it's denormalized, data are replicated in several places. It then takes more space, but the desired view of the data is readily available.
There are other time-space optimizations, such as
denormalized view
precomputed columns
As with any of such approach, this improves reading data (because they are readily available), but updating data becomes more costly (because you need to update the replicated or precomputed data).

The word "denormalizing" leads to confusion of the design issues. Trying to get a high performance database by denormalizing is like trying to get to your destination by driving away from New York. It doesn't tell you which way to go.
What you need is a good design discipline, one that produces a simple and sound design, even if that design sometimes conflicts with the rules of normalization.
One such design discipline is star schema. In a star schema, a single fact table serves as the hub of a star of tables. The other tables are called dimension tables, and they are at the rim of the schema. The dimensions are connected to the fact table by relationships that look like the spokes of a wheel. Star schema is basically a way of projecting multidimensional design onto an SQL implementation.
Closely related to star schema is snowflake schema, which is a little more complicated.
If you have a good star schema, you will be able to get a huge variety of combinations of your data with no more than a three way join, involving two dimensions and one fact table. Not only that, but many OLAP tools will be able to decipher your star design automatically, and give you point-and-click, drill down, and graphical analysis access to your data with no further programming.
Star schema design occasionally violates second and third normal forms, but it results in more speed and flexibility for reports and extracts. It's most often used in data warehouses, data marts, and reporting databases. You'll generally have much better results from star schema or some other retrieval oriented design, than from just haphazard "denormalization".

The critical issues in denormalizing are:
Deciding what data to duplicate and why
Planning how to keep the data in synch
Refactoring the queries to use the denormalized fields.
One of the easiest types of denormalizing is to populate an identity field to tables to avoid a join. As identities should not ever change, this means the issue of keeping the data in sync rarely comes up. For instance, we populate our client id to several tables because we often need to query them by client and do not necessarily need, in the queries, any of the data in the tables that would be between the client table and the table we are querying if the data was totally normalized. You still have to do one join to get the client name, but that is better than joining to 6 parent tables to get the client name when that is the only piece of data you need from outside the table you are querying.
However, there would be no benefit to this unless we were often doing queries where data from the intervening tables was needed.
Another common denormalization might be to add a name field to other tables. As names are inherently changeable, you need to ensure that the names stay in synch with triggers. But if this saves you from joining to 5 tables instead of 2, it can be worth the cost of the slightly longer insert or update.

If you have certain requirement, like reporting etc., it can help to denormalize your database in various ways:
introduce certain data duplication to save yourself some JOINs (e.g. fill certain information into a table and be ok with duplicated data, so that all the data in that table and doesn't need to be found by joining another table)
you can pre-compute certain values and store them in a table column, insteda of computing them on the fly, everytime to query the database. Of course, those computed values might get "stale" over time and you might need to re-compute them at some point, but just reading out a fixed value is typically cheaper than computing something (e.g. counting child rows)
There are certainly more ways to denormalize a database schema to improve performance, but you just need to be aware that you do get yourself into a certain degree of trouble doing so. You need to carefully weigh the pros and cons - the performance benefits vs. the problems you get yourself into - when making those decisions.

Consider a database with a properly normalized parent-child relationship.
Let's say the cardinality is an average of 2x1.
You have two tables, Parent, with p rows. Child with 2x p rows.
The join operation means for p parent rows, 2x p child rows must be read. The total number of rows read is p + 2x p.
Consider denormalizing this into a single table with only the child rows, 2x p. The number of rows read is 2x p.
Fewer rows == less physical I/O == faster.

As per the last section of this article,
https://technet.microsoft.com/en-us/library/aa224786%28v=sql.80%29.aspx
one could use Virtual Denormalization, where you create Views with some denormalized data for running more simplistic SQL queries faster, while the underlying Tables remain normalized for faster add/update operations (so long as you can get away with updating the Views at regular intervals rather than in real-time). I'm just taking a class on Relational Databases myself but, from what I've been reading, this approach seems logical to me.

Benefits of de-normalization over normalization
Basically de-normalization is used for DBMS not for RDBMS. As we know that RDBMS works with normalization, which means no repeat data again and again. But still repeat some data when you use foreign key.
When you use DBMS then there is a need to remove normalization. For this, there is a need for repetition. But still, it improves performance because there is no relation among the tables and each table has indivisible existence.

Java - Google App Engine - modelling graph structures in Google Datastore

Google Apps Engine offers the Google Datastore as the only NoSQL database (I think it is based on BigTable).
In my application I have a social-like data structure and I want to model it as I would do in a graph database. My application must save heterogeneous objects (users,files,...) and relationships among them (such as user1 OWNS file2, user2 FOLLOWS user3, and so on).
I'm looking for a good way to model this typical situation, and I thought to two families of solutions:
List-based solutions: Any object contains a list of other related objects and the object presence in the list is itself the relationship (as Google said in the JDO part https://developers.google.com/appengine/docs/java/datastore/jdo/relationships).
Graph-based solution: Both nodes and relationships are objects. The objects exist independently from the relationships while each relationship contain a reference to the two (or more) connected objects.
What are strong and weak points of these two approaches?
About approach 1: This is the simpler approach one can think of, and it is also presented in the official documentation but:
Each directed relationship make the object record grow: are there any limitations on the number of the possible relationships given for instance by the object dimension limit?
Is that a JDO feature or also the datastore structure allows that approach to be naturally implemented?
The relationship search time will increase with the list, is this solution suitable for large (million) of relationships?
About approach 2: Each relationship can have a higher level of characterization (it is an object and it can have properties). And I think memory size is not a Google problem, but:
Each relationship requires its own record, so the search time for each related couple will increase as the total number of relationships increase. Is this suitable for large amount of relationships(millions, billions)? I.e. does Google have good tricks to search among records if they are well structured? Or I will be soon in a situation in which if I want to search a friend of User1 called User4 I have to wait seconds?
On the other side each object doesn't increase in dimension as new relationships are added.
Could you help me to find other important points on the two approaches in such a way to chose the best model?

First, the search time in the Datastore does not depend on the number of entities that you store, only on the number of entities that you retrieve. Therefore, if you need to find one relationship object out of a billion, it will take the same time as if you had just one object.
Second, the list approach has a serious limitation called "exploding indexes". You will have to index the property that contains a list to make it searchable. If you ever use a query that references more than just this property, you will run into this issue - google it to understand the implications.
Third, the list approach is much more expensive. Every time you add a new relationship, you will rewrite the entire entity at considerable writing cost. The reading costs will be higher too if you cannot use keys-only queries. With the object approach you can use keys-only queries to find relationships, and such queries are now free.
UPDATE:
If your relationships are directed, you may consider making Relationship entities children of User entities, and using an Object id as an id for a Relationship entity as well. Then your Relationship entity will have no properties at all, which is probably the most cost-efficient solution. You will be able to retrieve all objects owned by a user using keys-only ancestor queries.

I have an AppEngine application and I use both approaches. Which is better depends on two things: the practical limits of how many relationships there can be and how often the relationships change.
NOTE 1: My answer is based on experience with Objectify and heavy use of caching. Mileage may vary with other approaches.
NOTE 2: I've used the term 'id' instead of the proper DataStore term 'name' here. Name would have been confusing and id matches objectify terms better.
Consider users linked to the schools they've attended and vice versa. In this case, you would do both. Link the users to schools with a variation of the 'List' method. Store the list of school ids the user attended as a UserSchoolLinks entity with a different type/kind but with the same id as the user. For example, if the user's id = '6h30n' store a UserSchoolLinks object with id '6h30n'. Load this single entity by key lookup any time you need to get the list of schools for a user.
However, do not do the reverse for the users that attended a school. For that relationship, insert a link entity. Use a combination of the school's id and the user's id for the id of the link entity. Store both id's in the entity as separate properties. For example, the SchoolUserLink for user '6h30n' attending school 'g3g0a3' gets id 'g3g0a3~6h30n' and contains the fields: school=g3g0a3 and user=6h30n. Use a query on the school property to get all the SchoolUserLinks for a school.
Here's why:
Users will see their schools frequently but change them rarely. Using this approach, the user's schools will be cached and won't have to be fetched every time they hit their profile.
Since you will be getting the user's schools via a key lookup, you won't be using a query. Therefore, you won't have to deal with eventual consistency for the user's schools.
Schools may have many users that attended them. By storing this relationship as link entities, we avoid creating a huge single object.
The users that attended a school will change a lot. This way we don't have to write a single, large entity frequently.
By using the id of the User entity as the id for the UserSchoolLinks entity we can fetch the links knowing just the id of the user.
By combining the school id and the user id as the id for the SchoolUser link. We can do a key lookup to see if a user and school are linked. Once again, no need to worry about eventual consistency for that.
By including the user id as a property of the SchoolUserLink we don't need to parse the SchoolUserLink object to get the id of the user. We can also use this field to check consistency between both directions and have a fallback in case somehow people are attending hundreds of schools.
Downsides:
1. This approach violates the DRY principle. Seems like the least of evils here.
2. We still have to use a query to get the users who attended a school. That means dealing with eventual consistency.
Don't forget Update the UserSchoolLinks entity and add/remove the SchoolUserLink entity in a transaction.

You question is too complex but I try explain the best solution (I will answer in Python but same can be done in Java).
class User(db.User):
followers = db.StringListProperty()
Simple add follower.
user = User.get(key)
user.followers.append(str(followerKey))
This allow fast query who is followed and followers
User.all().filter('followers', followerKey) # -> followed
This query i/o costly so you can make it faster but more complicated and costly in i/o writes:
class User(db.User):
followers = db.StringListProperty()
follows = db.StringListProperty()
Whatever this is complicated during changes since delete of Users need update follows so you need 2 writes.
You can also store relationships but it is the worse scenario since it is more complex than second example with followers and follows ... - keep in mind than entity can have 1Mb it is not limit but can be.

google datastore aggregate query

I have been reading a lot on ways to do aggregate queries on the datastore (thru stackoverflow and elsewhere). The preponderance of answers is that it cannot be done in a pleasant way. But then those answers are dated, and the same people tend to also claim that you cannot do things such as order by on the datastore.
As it exists today, you actually can specify ORDER BY on the datastore. So I am wondering if aggregation is also possible.
Consider the scenario where I have five candidates Alpha, Brave, Charie, Delta and Echo; and 10,000 voters. I want to retrieve the candidates and the number of votes each received in order. How would I do that on the datastore? I am using java.
Also, as an aside, if the answer is still no and fanning-in is my best option: is fan-in thread safe? By fanning-in I mean keeping an explicit counter that counts the vote each candidate receives (in a separate table). Could I experience a race condition or some other faults in the data when multiple users are voting concurrently?

If by aggregating you mean having the datastore compute the total # of votes for you, then no, the datastore won't do that.
The best way to do what you're describing is:
Create a set of sharded counters per candidate (google search for app engine sharded counters).
When someone votes, update the sharded counter for the given delegate.
When you want to read the votes, query for your delegates, then for each delegate, query for the sharded counters and sum them up.
Memcache for better performance, the GAE sharding counters example available in the docs shows this pretty well.

Its recently launched and available for use now: https://cloud.google.com/datastore/docs/aggregation-queries.
There are various client libraries also which support this particular feature.

Multiple link between 2 tables columns...bad design approach?

Hello
I'm developing a webapp and i'm about to design the database, i came across this question.
Is it a bad design to have more then 1 link between 2 tables?
The picture i have posted is a very quick and small example just to make it clearer.
If i would like to display all the offers,i would like to insert also the products they are related to, in this case i could retrieve the product name by creating a product instance retrieved with the product id from the product id field in the offer object, but it would require more queries execution and more typing work, so i was thinking to include the product name directly in the offer so that i can simply retrieve all offers and eventually display the related product by browsing the DB with its product id.
Would you consider this a bad approach?
I have been looking around for cases like mine but i have only found approaches with 1 connection between tables (with unique id's)
Thank you

This is data denormalization. Don't do it (in most cases). Design the tables correctly, let the database do the correct work with the correct queries. It will be much easier to maintain and work with over time.
Use the ID in the offer table to lookup the product name in the products table.

yes this would be bad.
removing the redundant name would be proper normalization. just link on the id, that will be the best way.

In general there is no limit to the number of relationships (links) between two tables, but each relationship should have a unique meaning. If, in your example, Product Name and Product ID are both candidate keys and each name always has the same ID then you should definitely not have two PK/FK relationships between these tables.

#Joe is right. Normalization is the best approach to take with database design. The reason being so that additions, deletions, and modifications of a field can be made in just one table and then propagated through the rest of the database via the defined relationships.

How to efficiently unpublish all datas from a particular user on a blogging application?

We develop and operate a blogging application in which user data a scattered across many tables:
- Blog
- Article
- Comment
- Message
- Trackback
- 50 other tables.
Users are able to close their account, and their account/contents must disappear from the site right away.
For legal/contractual reasons, we also must be able to undelete their account/content for a given duration, and also to make those data available for juridic authorities for another duration.
Over the years and different applications, we used different approaches:
"deleted" flag everywhere : Each table has a "deleted" column, which is updated when data is deleted/restored. Very nasty because it slows down every list generation queries, creates a lot of updates upon deletion/restore. Also, it does not handle the two stage deletion described above. In fact we never used this one, but it's worth dis-advising it :)
"Multi table": For each table, we create a second table with the same schema plus two extra fields (dateDeleted, reason). The extra fields are used to know if the data is still accessible for restoration, when to delete it, and why/how it was deleted in the first place. This version is just a bit better than the previous version, but can be very nasty performance wise too when tables are growing. Also, you have to change the schema of some tables (ie: remove UNIQUE constraints) which makes the system harder to understand/upgrade for new developers, administrators ... and mentally healthy people in general.
"Multi DB": Same approach as before, but we move data on a different database cluster, which allows to browse those data without impacting the "end users" db. Also, for this app, the uniqueness constraint is done at the java level, so all the schemas are the same. Lastly, the double data retention constraint is done by having a dedicated DB for each constraint, which makes things easiers.
I have to admit that none of those approaches satisfies me, even if they can work up to a certain amount of data. I have also imagined that we could just delete some key rows in the DB, and let the rest inconsistent (and scheduled for a more controlled deletion job), but it scares me ...
Do you know other ways of doing the same thing, keeping the same level of features (we could align the two durations to simplify the problem) ? I'm not looking a solution for my existing apps, but would like to improve the next ones.
Any input will be highly appreciated !

It seams that every asset (blog, comment, ...) relies on the user. I would give the user table a column "active" which is 0 or 1, Then you implement a feature to ask on each query for the different asset "user active"? Try to optimize this lookup with indizes or something like that. In my opinion its the cleanst way. After this you can implement a job, which runs a cascading delete on users disabled for longer then x days.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.