i am in the early stages of designing a VERY large system (its an enterprise level point of sale system). as some of you know the data models on these things can get very complicated. i want to run this thing on google app engine because i want to put more of my resources to developing the software rather than building and maintaining an infrastructure.
in that spirit of things, ive been doing a lot of reading on GAE and DataStore. im an old school relational database modeler and ive seen several different concepts of what a schemaless database is and i think ive figured out what datastore is but i want to make sure i have it right
so, if im right gae is a sorta table based system. so if i create a java entity
class user
public string firstname
public string lastname
and deploy it, the "table" user is automatically created and running. then in subsquent releases if i modify class user
class user
public string firstname
public string lastname
public date addDate
and deploy it, the "table" user is automatically updated with the new field.
now, in relating data, as i understand it, its very similar to some of the massively complex systems like SAP where the data is in fact very organized, but due to the volume its referential integrity is a function of the application, not the database engine. so i would have code that looks like this
class user
public long id
public string firstname
public string lastname
class phone
public string phonenumber
public user userentity
and to pull up the phone numbers for a user from scratch instead of
select phone from phone inner join user as phone.userentity = user where user.id = 5
(lay off i know the syntax is incorrect but you get the point)
i would do something like
select user from user where user.id = 5
then
select phone from phone where phone.userentity = user
and that would retrieve all the phone numbers for the user.
so, as i understand, its not so much a huge change in how to think about structuring data and organizing data, as its a big change on how to access it. i do joins manually with code instead of joins automatically with the database engine. beyond that its the same. am i correct or am i clueless.
There are really no tables at all. If you make some users with only a first and last name, and then later add addDate, then your original entities will still not have an addDate property. None of the user entities are connected at all, in any way. They are not in a table of Users.
You can access all of the objects you wrote to the database that have the name "User" because appengine keeps big, long lists (indexes) of all of the objects that have each name. So, any object you put in there that has the name (kind) "User" will get an entry in this list. Later, you can read that index to get the location of each of your objects, and use those locations (keys) to fetch the objects. They are not in a table, they're just floating around. Some of them have some properties in common, but this is a coincidence, and not a requirement.
If you want to fetch all of the User objects that have a certain name (Select * from User where firstname="Joe") then you have to maintain another big long index of keys. This index has the firstname property as well as the key of an entity on each row. Later you can scan the index for a certain firstname, get all the keys, and then go look up the actual entities you stored with those keys. All of THOSE entities will have the firstname property (because you wouldn't enter an entity without the firstname property on your firstname index), but they may not have any other fields in common, because they are not in a table that enforces any data structure at all.
These complications affect the way data is accessed pretty dramatically, and really affect things like transactions and complex queries. You're basically right that you don't have to change your thinking too much, but you should definitely understand how indexes and transactions work before planning your data structures. It is not always simple to efficiently tack on extra queries that you didn't think of before you got started, and it's pretty expensive to maintain these indexes, so the fewer you can get by with the better.
Great introduction to Google datastore is written by the creator of objectify framework: Fundamental Concepts of the Datastore
Related
Google Apps Engine offers the Google Datastore as the only NoSQL database (I think it is based on BigTable).
In my application I have a social-like data structure and I want to model it as I would do in a graph database. My application must save heterogeneous objects (users,files,...) and relationships among them (such as user1 OWNS file2, user2 FOLLOWS user3, and so on).
I'm looking for a good way to model this typical situation, and I thought to two families of solutions:
List-based solutions: Any object contains a list of other related objects and the object presence in the list is itself the relationship (as Google said in the JDO part https://developers.google.com/appengine/docs/java/datastore/jdo/relationships).
Graph-based solution: Both nodes and relationships are objects. The objects exist independently from the relationships while each relationship contain a reference to the two (or more) connected objects.
What are strong and weak points of these two approaches?
About approach 1: This is the simpler approach one can think of, and it is also presented in the official documentation but:
Each directed relationship make the object record grow: are there any limitations on the number of the possible relationships given for instance by the object dimension limit?
Is that a JDO feature or also the datastore structure allows that approach to be naturally implemented?
The relationship search time will increase with the list, is this solution suitable for large (million) of relationships?
About approach 2: Each relationship can have a higher level of characterization (it is an object and it can have properties). And I think memory size is not a Google problem, but:
Each relationship requires its own record, so the search time for each related couple will increase as the total number of relationships increase. Is this suitable for large amount of relationships(millions, billions)? I.e. does Google have good tricks to search among records if they are well structured? Or I will be soon in a situation in which if I want to search a friend of User1 called User4 I have to wait seconds?
On the other side each object doesn't increase in dimension as new relationships are added.
Could you help me to find other important points on the two approaches in such a way to chose the best model?
First, the search time in the Datastore does not depend on the number of entities that you store, only on the number of entities that you retrieve. Therefore, if you need to find one relationship object out of a billion, it will take the same time as if you had just one object.
Second, the list approach has a serious limitation called "exploding indexes". You will have to index the property that contains a list to make it searchable. If you ever use a query that references more than just this property, you will run into this issue - google it to understand the implications.
Third, the list approach is much more expensive. Every time you add a new relationship, you will rewrite the entire entity at considerable writing cost. The reading costs will be higher too if you cannot use keys-only queries. With the object approach you can use keys-only queries to find relationships, and such queries are now free.
UPDATE:
If your relationships are directed, you may consider making Relationship entities children of User entities, and using an Object id as an id for a Relationship entity as well. Then your Relationship entity will have no properties at all, which is probably the most cost-efficient solution. You will be able to retrieve all objects owned by a user using keys-only ancestor queries.
I have an AppEngine application and I use both approaches. Which is better depends on two things: the practical limits of how many relationships there can be and how often the relationships change.
NOTE 1: My answer is based on experience with Objectify and heavy use of caching. Mileage may vary with other approaches.
NOTE 2: I've used the term 'id' instead of the proper DataStore term 'name' here. Name would have been confusing and id matches objectify terms better.
Consider users linked to the schools they've attended and vice versa. In this case, you would do both. Link the users to schools with a variation of the 'List' method. Store the list of school ids the user attended as a UserSchoolLinks entity with a different type/kind but with the same id as the user. For example, if the user's id = '6h30n' store a UserSchoolLinks object with id '6h30n'. Load this single entity by key lookup any time you need to get the list of schools for a user.
However, do not do the reverse for the users that attended a school. For that relationship, insert a link entity. Use a combination of the school's id and the user's id for the id of the link entity. Store both id's in the entity as separate properties. For example, the SchoolUserLink for user '6h30n' attending school 'g3g0a3' gets id 'g3g0a3~6h30n' and contains the fields: school=g3g0a3 and user=6h30n. Use a query on the school property to get all the SchoolUserLinks for a school.
Here's why:
Users will see their schools frequently but change them rarely. Using this approach, the user's schools will be cached and won't have to be fetched every time they hit their profile.
Since you will be getting the user's schools via a key lookup, you won't be using a query. Therefore, you won't have to deal with eventual consistency for the user's schools.
Schools may have many users that attended them. By storing this relationship as link entities, we avoid creating a huge single object.
The users that attended a school will change a lot. This way we don't have to write a single, large entity frequently.
By using the id of the User entity as the id for the UserSchoolLinks entity we can fetch the links knowing just the id of the user.
By combining the school id and the user id as the id for the SchoolUser link. We can do a key lookup to see if a user and school are linked. Once again, no need to worry about eventual consistency for that.
By including the user id as a property of the SchoolUserLink we don't need to parse the SchoolUserLink object to get the id of the user. We can also use this field to check consistency between both directions and have a fallback in case somehow people are attending hundreds of schools.
Downsides:
1. This approach violates the DRY principle. Seems like the least of evils here.
2. We still have to use a query to get the users who attended a school. That means dealing with eventual consistency.
Don't forget Update the UserSchoolLinks entity and add/remove the SchoolUserLink entity in a transaction.
You question is too complex but I try explain the best solution (I will answer in Python but same can be done in Java).
class User(db.User):
followers = db.StringListProperty()
Simple add follower.
user = User.get(key)
user.followers.append(str(followerKey))
This allow fast query who is followed and followers
User.all().filter('followers', followerKey) # -> followed
This query i/o costly so you can make it faster but more complicated and costly in i/o writes:
class User(db.User):
followers = db.StringListProperty()
follows = db.StringListProperty()
Whatever this is complicated during changes since delete of Users need update follows so you need 2 writes.
You can also store relationships but it is the worse scenario since it is more complex than second example with followers and follows ... - keep in mind than entity can have 1Mb it is not limit but can be.
I am currently developing a Google AppEngine (GAE) application and I am struggling a bit with the GAE DataStore best practices. I would like to use the DataStore in the most efficient way. I am using the Objectify framework, but am flexible to use something else if there is a better alternative.
My application uses three objects/tables:
- Items (id, description)
- List (id, listId, listDescription
- SecurityProfile (id,listId, username, accessType)
I an relational world, my Items and SecurityProfiles tables would have an external key to link them to a list (ListId) and I would then use joins in my queries.
The typical Queries I need to make:
- Get all lists accessible to a particular user (need an index on "username" to filter by username and need to get the description from the List table)
- Get all items in list for a particular user (get the Items linked to the Lists retrieved in the query above)
I am struggling a bit to come up with a way to link the different objects in an efficient way (minimizing the DataStore queries and indexes).
I have seen in other posts that joins should be avoided and that I should de-normalize the model as much as possible.
So kind of creating one object only:
- Data (id, description, listId, listDescription, username, accessType)
I can see how that work from a read point of view, but if I update a listDescription, an accessType or add a new username, I could potentially have to update a massive amount of records. Is this really the way to go ?
I'm only familiar with the Python NDB API, but things are similar in Java.
In Python NDB, I would recommend to create a Model for each
User,
List,
List item
Then, you can reference them with repeated KeyProperties, e.g.
class SecurityProfiles(ndb.Model):
accessibleLists = ndb.KeyProperty(repeated=true)
class List(ndb.Model):
listItems = ndb.KeyProperty(repeated=true)
Like this, you can pull a user's profile from the DataStore, and with the keys stored in accessibleLists you can get the lists accessible to the user.
Alternatively, you could do it the other way around:
class List(ndb.Model):
usersWithAccess = ndb.KeyProperty(repeated=true)
and then you could immediately query for lists that are accessible to a given user.
I am just starting off with app development and am currently writing an Android application which has registered users and a list of 'challenges' which they are able to select and later mark as completed/failed.
The plan is to eventually store all users/challenge/etc data on a database though I haven't implemented this yet.
The issue I have run in to is this - in my current design each User has list variables containing their current challenges and completed challenges eg. two ArrayList fields.
Users currently select challenges from a listview of different Challenge objects, which are then added to the user's CurrentChallenges list.
What I had not accounted for is how to structure this so that when a user takes on a challenge, they have their own unique copy of that challenge that can be independently marked as completed etc, whereas at the minute every user that selects say, Challenge 1, is simply adding the same challenge with the same ID etc. as each other user that selects Challenge 1.
I supposed I could have each different challenge be its own sub-class of Challenge and assign every user which selects that challenge type a different instance of that class, however this seems like it would be a very messy/inefficient method as all the different classes would be largely the same.
Does anyone have any good ideas or design patterns for this case? Preferably a solution that will be compatible with later storing these challenges in a database and presumably using ORM.
Thanks a lot for any suggestions,
E
I'd move every aspect of a challenge that is different for each user into a new Attempt class. So Challenge might have variables for name, description etc. and Attempt would have inProgress, completed etc. Obviously these are just examples, replace them with whatever data you're actually storing.
Now in your User class, you can record challenges using a Map. Make it a Map<Challenge, Attempt> and each User will be able to store an Attempt for each Challenge to record their progress. The Challenge instances are shared between users but there is an Attempt instance for each combination of User and Challenge.
When you implement the database later, Challenge, User and Attempt would each translate to a table. Attempt would have foreign keys for both of the other tables. Unfortunately I haven't used ORMs much so I'm not sure whether they'd work with a Map correctly.
I have a customer with a very small set of data and records that I'd normally just serialize to a data file and be done but they want to run extra reports and have expandability down the road to do things their own way. The MySQL database came up and so I'm adapting their Java POS (point of sale) system to work with it.
I've done this before and here was my approach in a nutshell for one of the tables, say Customers:
I setup a loop to store the primary key into an arraylist then setup a form to go from one record to the next running SQL queries based on the PK. The query would pull down the fname, lname, address, etc. and fill in the fields on the screen.
I thought it might be a little clunky running a SQL query each time they click Next. So I'm looking for another approach to this problem. Any help is appreciated! I don't need exact code or anything, just some concepts will do fine
Thanks!
I would say the solution you suggest yourself is not very good not only because you run SQL query every time a button is pressed, but also because you are iterating over primary keys, which probably are not sorted in any meaningful order...
What you want is to retrieve a certain number of records which are sorted sensibly (by first/last name or something) and keep them as a kind of cache in your ArrayList or something similar... This can be done quite easily with SQL. When the user starts iterating over the results by pressing "Next", you can in the background start loading more records.
The key to keep usability is to load some records before the user actually request them to keep latency small, but keeping in mind that you also don't want to load the whole database at once....
Take a look at indexing your database. http://www.informit.com/articles/article.aspx?p=377652
Use JPA with the built in Hibernate provider. If you are not familiar with one or both, then download NetBeans - it includes a very easy to follow tutorial you can use to get up to speed. Managing lists of objects is trivial with the new JPA and you won't find yourself reinventing the wheel.
the key concept here is pagination.
Let's say you set your page size to 10. This means you select 10 records from the database, in a certain order, so your query should have an order by clause and a limit clause at the end. You use this resultset to display the form while the users navigates with Previous/Next buttons.
When the user navigates out of the page then you fetch an other page.
https://www.google.com/search?q=java+sql+pagination
I have a Database storing details of products which are taken from many sites, and gathered through the individual sites API's. When I call the feed, the details are stored in a database table.
The problem I'm having is that because the exact same product is listed on many sites by the seller I end up having duplicate items in my database, and then when I display them on a web page there are many duplicates.
The problem is that the item doesn't have any obvious unique identifier, it has specific details of the item (of which there could be many), and then a description of the item from the seller.
What I would like is for the item to show up once, and then give the user details of where else the item is listed.
How would I identify the duplicates that have come in, without slowing down the entire database? How would I also then pick one advert from all the duplicates, and then store what other sites the advert is displayed on.
Thanks for any help.
The problem is two-fold, and both are on your side. When you figure out how to deal with that, writing the code into a program (Java or SQL will be easy). I'll name them first and then identify the solutions.
For some unknown reason, you have assumed that collecting product descriptions from mulitple sites will not collect the same product.
You are used to the common and nonsensical Id column, which is fine when you are working with spreadsheets prototyping functionality; but it is nowhere near what is required for a database or Development-level functionality. Your users (or boss) have naturally expected database capability from the database, and you did not provide any. (And no, it does not require fuzzy string logic or magic of any kind.)
Solution
This is a condensed version of the IDEF1X Standard for modelling Relational Databases; the portion re Identifiers.
You need to think in database terms, and think about the database tables you need to perform your function, which means you are not allowed to use an auto-increment Id column. That column gives a spreadsheet a RowId, but it does not imply anything about the content of the table, or the columns that identify a product.
And you cannot simply rip data off another website, you need to think about what your website requires for products. What does your company understand a product to be, and how does it identify a product ?
Identify all the columns and datatypes for the columns.
Identify which columns are mandatory and which are optional.
Identify which are strong Identifiers. Eg. Manufacturer and Model; the short Product Name, not the long Description (or may be for your company, the long description is an Identifier). Work with your users, and work that out.
You will find you actually have a small cluster of tables around Product, such as Manufacturer, ProductType, perhaps Vendor, etc.
Organise those tables, and Normalise them, so that you are not duplicating data.
Make sure you treat those Identifiers with a bit of respect. Choose which will be unique. Those are Candidate Keys. You need at least one per table, and there will be more than one in Product. All the Identifiers that will be searched on will need to be indexed (Unique or not). Note that Unique Indices cannot be Nullable, so you cannot choose an optional column.
What makes a single Unique Identifier for Product may not be a single column. That's ok, we can evaluate multiple columns for keys in databases; they are called Compound Keys.
Take the best, most stable (one which will not change) Unique Identifier, one of the Candidate Keys, and make that the Primary Key.
If, and only if, the Unique Identifier, the Primary Key, which may be a Compound Key, is very long, and therefore unsuitable for a Primary Key, which is migrated to the child tables, then add a Surrogate Key. That will be the Id column. Note that that is an additional column and additional Index. It is not a substitute for the Identifiers of Product, the Candidate Keys; they cannot be removed.
So far we have a Product database on your companies side of the web, that is meaningful to it. Now we are in a position to evaluate products from the other side of the web; and when we do, we have a framework on our side that is strong, against which we can measure the rubbish that we get from the other side of the web.
Feeds
You need a WebSite table to manage the feeds.
There will be an Associative table (many-to-many) between Product and WebSite. Let's call it ProductSite. It will contain only our ProductId, and the WebSiteCode. It may containPrice`. The contents are valid for a single feed cycle.
Load each feed into a staging database or schema, an incoming ProductIn table, maybe one per source website. This is just the flat file from the external source. Add a column IsValid and set the Default to true.
Then write some SQL that compares that ProductIn table, with its loose and floppy contents, with our Product table with its strong Identifiers.
The way I would do it is, several waves of separate checks, each marking the rows that fail, with IsValid to false. At the end Insert the IsValid rows into our ProductSite.
You might be lucky, and get away with an optimistic approach. That is, as long as you find a match on a few important columns, the match is valid. (reverse the Default and update of the IsValid boolean).
This is the proc that will require some back-and-forth work, until it settles down. That is why you need to work with your users re the Indentifiers. The goal is to exclude no external products, but your starting point will exclude many. That will include going back to our Product table and improving the content (values in the rows) of the Identifiers, and other relevant columns that you use to identify matching rows.
Repeat for each WebSite.
Now populate our website from our Product table, using information that we are confident about, and show which sites have the product for sale from ProductSite.
I don't think this is a code or database problem (yet). You say:
The problem is that the item doesn't have any obvious unique identifier
You need to work out what that uniqeness is before you can ask a computer to do that for you. It sounds like you need some sort of fuzzy, string similarity algorithm.
Some examples of data that you consider duplicates might help.