I am using Google Cloud Endpoints (Java) as a backend for an Android app. The data on the backend is stored to Datastore.
The problem is that the Android client often doesn't receive correct data from the backend. After repeating the request, the correct data is eventually received.
Can you tell me what could cause this problem ? Is there some kind of response caching in the Cloud Endpoints library ? Or caching within Datastore ?
I have read about eventual consistency provided by Datastore, so I am trying to wait at least 10 seconds after updating the entities before the next request with query. Is it enough ?
Thanks a lot.
It sounds as though the way you structured your data model is not compliant with the way in which you want to use the data model. Without seeing an example of your data model I can't give you hints as to how to change it but here is a simple example that might help you.
Suppose I have a "Post" entity and a "CommentForPost" entity. Users can create many CommentForPost entries for each Post entry. Logically, the CommentForPost is a child of Post but if you don't make the Post an ancestor of all the CommentForPost entries then immediate consistency is not guaranteed. If user A creates a comment and then queries for it they may not see the entry right away.
On the other hand, if you make the Post entity the parent of all of the CommentForPost entries then you will guarantee immediate consistency because they are being saved as a single EntityGroup. If user A creates an entry and immediately queries for it (using the Post key as an ancestor) then they are for sure going to get accurate data back.
The limitation here is that you would only be able to create one CommentForPost entry per second. This is where the give and take lays. If you need a lot of updates within short periods of time then you can't guarantee consistency (e.g. 100 users are all adding CommentForPost entries at the same time). If you want to guarantee consistency then you can't have a lot of updates within a short period of time.
Make sense?
Straight from the docs at Structuring Data for Strong Consistency: If you always require the result of a Datastore query to be consistent, you will need to use an ancestor query as per the example code in the article:
Query query = new Query("Greeting", guestbookKey).setAncestor(guestbookKey);
You can't rely on a delay to obtain consistency as it depends on replication across data centers and the amount of time this takes is never, well, consistent :)
Related
Without relying on the database, is there a way to ensure a field (let's say a User's emailAddress) is unique.
Some common failure attempts:
Check first if emailAddress exists (by querying the DB) and if not then create the user. Now obviously in the window of check-then-act some other thread can create a user with same email. Hence this solution is no good.
Apply a language-level lock on the method responsible for creating the user. This solution fails as we need redundancy of the service for performance reasons and lock is on a single JVM.
Use an Event store (like an Akka actor's mailbox), event being an AddUser message, but since the actor behavior is asynchronous, the requestor(sender) can't be notified that user creation with unique email was successful. Moreover, how do 2 requests (with same email) know they contain a unique email? This may get complicated.
Database, being a single source of data that every thread and every service instance will write to, makes sense to implement the unique constraint here. But this holds true for Relational databases.
Then what about NoSql databases? some do allow for a unique constraint, but it's not their native behavior, or maybe it is.
But the question of not using the database to implement uniqueness of a field, what could be the options?
I think your question is more generic - "how do I ensure a database write action succeeded, and how do I handle cases where it didn't?". Uniqueness is just one failure mode - you may be attempting to insert a value that's too big, or of the wrong data type, or that doesn't match a foreign key constraint.
Relational databases solve this through being ACID-compliant, and throwing errors for the client to deal with when a transaction fails.
You want (some of) the benefits of ACID without the relational database. That's a fairly big topic of conversation. The obvious way to solve this is to introduce the concept of "transaction" in your application layer. For instance, in your case, you might send a "create account(emailAddress, name, ...)" message, and have the application listen for either an "accountCreated" or "accountCreationFailed" response. The recipient of that message is responsible for writing to the database; you have a couple of options. One is to lock that thread (so only one process can write to the database at any time); that's not super scalable. The other mechanism I've used is introducing status flags - you write the account data to the database with a "draft" flag, then check for your constraints (including uniqueness), and set the "draft" flag to "validated" if the constraints are met (i.e. there is no other record with the same email address), and "failed" if they are not.
to check for uniquness you need to store the "state" of the program. for safety you need to be able to apply changes to the state transactionally.
you can use database transactions. a few of the NoSQL databases support transactions too, for example, redis and MongoDB. you have to check for each vendor separately to see how they support transactions. in this setup, each client will connect to the database and it will handle all of the details for you. also depending on your use case you should be careful about the isolation level configuration.
if durability is not a concern then you can use in memory databases that support transactions.
which state store you choose, it should support transactions. there are several ways to implement transactions and achieve consistency. many relational databases like PostgresSQL achieve this by implementing the MVCC algorithm. in a distributed environment you have to look for distributed transactions such as 2PC, Paxos, etc.
normally everybody relies on availabe datastore solutions unless there is a weird or specific requirement for the project.
final note, the communication pattern is not related to the underlying problem here. for example, in the Actor case you mentioned, at the end of the day, each actor has to query the state to find if a email exists or not. if your state store supports Serializability then there is no problem and conflicts will not happen (communicating the error to the client is another issue). suppose that you are using PostgreSQL. when a insert/update query is issued, it is wrapped around a transaction and the underlying MVCC algorithm will take care of everything. in an advanced and distrbiuted environment you can use data stores that support distributed transactions, like CockroachDB.
if you want to dive deep you can research these keywords: ACID, isolation levels, atomicity, serializability, CAP theorem, 2PC, MVCC, distributed transacitons, distributed locks, ...
NoSQL databases provide different, weaker, guarantees than relational databases. Generally, the tradeoff is you give up ACID guarantees in exchange for increased scalability in the dimensions that matter for your application.
It's possible to provide some kind of uniqueness guarantee, but subject to certain tradeoffs. With NoSQL, there are always tradeoffs.
If your NoSQL store supports optimistic concurrency control, maybe this approach will work:
Store a separate document that contains the set of all emailAddress values, across all documents in your NoSQL table. This is one instance of this document at a given time.
Each time you want to save a document containing emailAddress, first confirm email address uniqueness:
Perform the following actions, protected by optimistic locking. You can on the backend if this due to a concurrent update:
Read this "all emails" document.
Confirm the email isn't present.
If not present, add the email address to the "all emails document"
Save it.
You've now traded one problem ... the lack of unique constraints, for another ... the inability to synchronise updates across your original document and this new "all emails" document. This may or may not be acceptable, it depends on the guarantees that your application needs to provide.
e.g. Maybe you can accept that an email may be added to "all emails", that saving the related document to your other "table" subsequently fails, and that that email address is now not able to be used. You could clean this up with a batch job somehow. Not sure.
The index of emails could be stored in some other service (e.g. a persistent cache). The same problem exists, you need to keep the index and your document store in sync somehow.
There's no easy solution. For a detailed overview of the relevant concepts, I'd recommend Designing Data-Intensive Applications by Martin Kleppmann.
I have seen videos and read the documentation of Cloud firestore, from Google Firebase service, but I can't figure this out coming from realtime database.
I have this web app in mind in which I want to store my providers from different category of products. I want perform a search query through all my products to find what providers I have for such product, and eventually access that provider info.
I am planning to use this structure for this purpose:
Providers ( Collection )
Provider 1 ( Document )
Name
City
Categories
Provider 2
Name
City
Products ( Collection )
Product 1 ( Document )
Name
Description
Category
Provider ID
Product 2
Name
Description
Category
Provider ID
So my question is, is this approach the right way to access the provider info once I get the product I want?
I know this is possible in the realtime database, using the provider ID I could search for that provider in the providers section, but with Firestore I am not sure if its possible or if this is right approach.
What is the correct way to structure this kind of data in Firestore?
You need to know that there is no "perfect", "the best" or "the correct" solution for structuring a Cloud Firestore database. The best and correct solution is the solution that fits your needs and makes your job easier. Bear also in mind that there is also no single "correct data structure" in the world of NoSQL databases. All data is modeled to allow the use-cases that your app requires. This means that what works for one app, may be insufficient for another app. So there is not a correct solution for everyone. An effective structure for a NoSQL type database is entirely dependent on how you intend to query it.
The way you are structuring your data looks good to me. In general, there are two ways in which you can achieve the same thing. The first one would be to keep a reference of the provider in the product object (as you already do) or to copy the entire provider object within the product document. This last technique is called denormalization and is a quite common practice when it comes to Firebase. So we often duplicate data in NoSQL databases, to suit queries that may not be possible otherwise. For a better understanding, I recommend you see this video, Denormalization is normal with the Firebase Database. It's for Firebase Realtime Database but the same principles apply to Cloud Firestore.
Also, when you are duplicating data, there is one thing that needs to keep in mind. In the same way, you are adding data, you need to maintain it. In other words, if you want to update/delete a provider object, you need to do it in every place that it exists.
You might wonder now, which technique is best. In a very general sense, the best way in which you can store references or duplicate data in a NoSQL database is completely dependent on your project's requirements.
So you should ask yourself some questions about the data you want to duplicate or simply keep it as references:
Is the static or will it change over time?
If it does, do you need to update every duplicated instance of the data so they all stay in sync? This is what I have also mentioned earlier.
When it comes to Firestore, are you optimizing for performance or cost?
If your duplicated data needs to change and stay in sync in the same time, then you might have a hard time in the future keeping all those duplicates up to date. This will also might imply you spend a lot of money keeping all those documents fresh, as it will require a read and write for each document for each change. In this case, holding only references will be the winning variant.
In this kind of approach, you write very little duplicated data (pretty much just the Provider ID). So that means that your code for writing this data is going to be quite simple and quite fast. But when reading the data, you will need to load the data from both collections, which means an extra database call. This typically isn't a big performance issue for reasonable numbers of documents, but definitely does require more code and more API calls.
If you need your queries to be very fast, you may want to prefer to duplicate more data so that the client only has to read one document per item queried, rather than multiple documents. But you may also be able to depend on local client caches makes this cheaper, depending on the data the client has to read.
In this approach, you duplicate all data for a provider for each product document. This means that the code to write this data is more complex, and you're definitely storing more data, one more provider object for each product document. And you'll need to figure out if and how to keep up to date on each document. But on the other hand, reading a product document now gives you all information about the provider document in one read.
This is a common consideration in NoSQL databases: you'll often have to consider write performance and disk storage vs. reading performance and scalability.
For your choice of whether or not to duplicate some data, it is highly dependent on your data and its characteristics. You will have to think that through on a case-by-case basis.
So in the end, remember that both are valid approaches, and neither of them is pertinently better than the other. It all depends on what your use-cases are and how comfortable you are with this new technique of duplicating data. Data duplication is the key to faster reads, not just in Cloud Firestore or Firebase Realtime Database but in general. Any time you add the same data to a different location, you're duplicating data in favor of faster read performance. Unfortunately in return, you have a more complex update and higher storage/memory usage. But you need to note that extra calls in Firebase real-time database, are not expensive, in Firestore are. How much duplication data versus extra database calls is optimal for you, depends on your needs and your willingness to let go of the "Single Point of Definition mindset", which can be called very subjective.
After finishing a few Firebase projects, I find that my reading code gets drastically simpler if I duplicate data. But of course, the writing code gets more complex at the same time. It's a trade-off between these two and your needs that determines the optimal solution for your app. Furthermore, to be even more precise you can also measure what is happening in your app using the existing tools and decide accordingly. I know that is not a concrete recommendation but that's software development. Everything is about measuring things.
Remember also, that some database structures are easier to be protected with some security rules. So try to find a schema that can be easily secured using Cloud Firestore Security Rules.
Please also take a look at my answer from this post where I have explained more about collections, maps and arrays in Firestore.
Let's presume that we have an application "mail client" and a front-end for it.
If a user is typing a message or editing the subject or whatever, a rest call is made to update whatever the user was changing (e.g. the receivers) to keep the message in DRAFT. So a lot PUT's are happening to save the message. When closing the window, an update of every editable field happens at the same time. Hibernate can't handle this concurrency: Each of those calls retrieve the message, edit their own fields and try to save the message again, while the other call already changed it.
I know I can add a rest call to save all fields at the same time, but I was wondering if there is a cleaner solution, or a decent strategy to handle such cases (like for example only updating one field or some merge strategy if the object has already changed)
Thanks in advance!
The easiest solutions here would be to tweak the UI to either
Submit a single rest call during email submission that does all the tasks necessary.
Serialize the rest calls so they're chained rather than firing concurrently.
The concern I have here is that this will snowball at some point and become a bigger concurrency problem as more users are interacting with the application. Consider for the moment the potential number of concurrent rest calls your web infrastructure will have to support alone when you're faced with a 100, 500, 1000, or even 10000 or more concurrent users.
Does it really make sense to beef up the volume of servers to handle that load when the load itself is a product of a design flaw in the first place?
Hibernate is designed to handle locking through two mechanisms, optimistic and pessimistic.
Optimistic Way
Read the entity from the data store.
Cache a copy of the fields you're going to modify in temporary variables.
Modify the field or fields based on your PUT operation.
Attempt to merge the changes.
If save succeeds, you're done.
Should an OptimisticLockException occur, refresh the entity state from data store.
Compare cached values to the fields you must change.
If values differ, you can assert or throw an exception
If they don't differ, go back to 4.
The beautiful part of the optimistic approach is you avoid any form of deadlock happening, particularly if you're allowing multiple tables to be read and locked separately.
While you can use pessimistic lock options, optimistic locking is generally the best accepted way to handle concurrent operations as it has the least concurrency contention and performance impact.
Background::::
I'm working with google app engine (GAE) for Java. I'm struggling to design a data model that plays to big table's strengths and weaknesses, these are two previous related posts:
Database design - google app engine
Appointments and Line Items
I've tentatively decided on a fully normalized backbone with denormalized properties added into entities so that most client requests can be serviced with only one query.
I reason that a fully normalized backbone will:
Help maintain data integrity if I code a mistake in the denormalization
Enable writes in one operation from a client's perspective
Allow for any type of unanticipated query on the data (provided one is willing to wait)
While the denormalized data will:
Enable most client requests to be serviced very fast
Basic denormalization technique:::
I watched an app engine video describing a technique referred to as "fan-out." The idea is to make quick writes to normalized data and then use the task queue to finish up the denormalization behind the scenes without the client having to wait. I've included the video here for reference, but its an hour long and theres no need to watch it in order to understand this question:
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
If I use this "fan-out" technique, every time the client modifies some data, the application would update the normalized model in one quick write and then fire off the denormalization instructions to the task queue so the client does not have to wait for them to complete as well.
Problem:::
The problem with using the task queue to update the denormalized version of the data is that the client could make a read request on data that they just modified before the task queue has completed the denormalization on that data. This would provide the client with stale data that is incongruent with their recent request confusing the client and making the application appear buggy.
As a remedy, I propose fanning out denormalization operations in parallel via asynchronous calls to other URLS in the application via URLFetch: http://code.google.com/appengine/docs/java/urlfetch/ The application would wait until all of the asynchronous calls had been completed before responding to the client request.
For example, if I have an "Appointment" entity and a "Customer" entity. Each appointment would include a denormalized copy of the customer information for who its scheduled for. If a customer changed their first name, the application would make 30 asynchronous calls; one to each affected appointment resource in order to change the copy of the customer's first name in each one.
In theory, this could all be done in parallel. All of this information could be updated in roughly the time it takes to make 1 or 2 writes to the datastore. A timely response could be made to the client after the denormalization was completed eliminating the possibility of the client being exposed to incongruent data.
The biggest potential problem I see with this is that the application can not have more than 10 asynchronous request calls going at any one time (documented here): http://code.google.com/appengine/docs/java/urlfetch/overview.html).
Proposed denormalization technique (recursive asynchronous fan-out):::
My proposed remedy is to send denormalization instructions to another resource that recursively splits the instructions into equal-sized smaller chunks, calling itself with the smaller chunks as parameters until the number of instructions in each chunk is small enough to be executed outright. For example, if a customer with 30 associated appointments changed the spelling of their first name. I'd call the denormalization resource with instructions to update all 30 appointments. It would then split those instructions up into 10 sets of 3 instructions and make 10 asynchronous requests to its own URL with each set of 3 instructions. Once the instruction set was less than 10, the resource would then make asynchronous requests outright as per each instruction.
My concerns with this approach are:
It could be interpreted as an attempt to circumvent app engine's rules, which would cause problems. (its not even allowed for a URL to call itself, so I'd in fact have to have two URL resources that handle the recursion that would call each other)
It is complex with multiple points of potential failure.
I'd really appreciate some input on this approach.
This sounds awfully complicated, and the more complicated the design the more difficult it is to code and maintain.
Assuming you need to denormalize your data, I'd suggest just using the basic denormalization technique, but keep track of which objects are being updated. If a client requests an object which is being updated, you know you need to query the database to get the updated data; if not, you can rely on the denormalized data. Once the task queue finishes, it can remove the object from the "being updated" list, and everything can rely on the denormalized data.
A sophisticated version could even track when each object was edited, so a given object would know if it had already been updated by the task queue.
It sounds like you are re-implemeting Materialized Views http://en.wikipedia.org/wiki/Materialized_view.
I suggest you the easy solution with Memcache. Uppon update from your client, you could save an Entity in the Memcache storing the Key of the updated Entity with the status 'updating'. When you task finisches, it will delete the Memcached status. Then you would check the status before a read, allowing the user to be correctly informed if the Entity is still 'locked'.
Multiple clients are concurrently accessing a JAX-JWS webservice running on Glassfish or some other application server. Persistence is provided by something like Hibernate or OpenJPA. Database is Microsoft SQL Server 2005.
The service takes a few input parameters, some "magic" occurs, and then returns what is basically a transformed version of the next available value in a sequence, with the particular sequence and transformation being determined by the inputs. The "magic" that performs the transformation depends on the input parameters and various database tables (describing the relationship between the input parameters, the transformation, the sequence to get the next base value from, and the list of already served values for a particular sequence). Not sure if this could all be wrapped up in a stored procedure (probably), but also not sure if the client wants it there.
What is the best way to ensure consistency (i.e. each value is unique and values are consumed in order, with no opportunity for a value to reach a client without also being stored in the database) while maintaining performance?
It's hard to provide a complete answer without a full description (table schemas, etc.), but giving my best guess here as to how it works, I would say that you need a transaction around your "magic", which marks the next value in the sequence as in use before returning it. If you want to reuse sequence numbers then you can later unflag them (for example, if the user then cancels what they're doing) or you can just consider them lost.
One warning is that you want your transaction to be as short and as fast as possible, especially if this is a high-throughput system. Otherwise your sequence tables could quickly become a bottleneck. Analyze the process and see what the shortest transaction window is that will still allow you to ensure that a sequence isn't reused and use that.
It sounds like you have most of the elements you need here. One thing that might pose difficulty, depending on how you've implemented your service, is that you don't want to write any response to the browser until your database transaction has been safely committed without errors.
A lot of web frameworks keep the persistence session open (and uncommitted) until the response has been rendered to support lazy loading of persistent objects by the view. If that's true in your case, you'll need to make sure that none of that rendered view is delivered to the client until you're sure it's committed.
One approach is a Servlet Filter that buffers output from the servlet or web service framework that you're using until it's completed its work.