how to deal with race conditions in a RESTful application? - java

Here is what's happening in my RESTful web app:
HTTP request comes in
The app starts to build a response, with some initial portion of data
Another requests changes the data that where used in step 2
The first request understands that the data are expired
What should it do? Fail the request and return an error to a client? Or it should start from scratch (taking more time than the client expects)?

IMHO you should treat a REST request very close to how you treat a DB transaction:
Either make sure, you lock what to need to lock before doing some real work
Or prepare to fail/retry on a concurrency issue
Very often this can actually be handed down to a DB transaction - depending on how much and what non-DB work your request does.

I think a good starting point is the concurrency model used by CouchDB. In essence:
Each request is handled in isolation, i.e. is not affected by other concurrent requests. This implies that you need to be able to get a consistent snapshot of the database when you begin processing a request, which most DBMS systems support with some notion of transaction.
GET requests always succeed, and return the state of the system at the point when they were submitted, ignoring any subsequent updates.
GET requests return a revision ID for the resource in question, which must be included as a field in any subsequent PUT request.
In a PUT request, the submitted revision ID is checked against the latest revision ID in the database. If they don't match then an error code is returned, in which case the client must re-fetch the latest version and re-apply any changes that they made.
More reading:
http://wiki.apache.org/couchdb/Technical%20Overview#ACID_Properties
http://wiki.apache.org/couchdb/HTTP_Document_API#PUT

Assuming it not about db transactions and say distributed long running processes are involved in each step.
In this scenario the client should be sent an appropriate response (something like 409/410 http codes) with details indicating that this request is no more valid and the client should try again. Retrying could end up in loops or worst case end up doing what client did not know.
Example, when you book a hotel/ticket online you get a response saying the price has since changed and you need to submit again to buy with new price.

From my point of view your question is the same as like:
"If I try to do a read from the database, and another transaction tries to do a write, it will block. But when I finish my read, I will have missed the new data that will be populated by the new transaction that comes in after my read."
This is a bad way to think about it. You should make sure that the clients get consistent data in the responses. If the data have been updated by the time they get the response that is not problem of the original method.
Your problem is that the data are currently updated, and I happen to know. What if the data are updated right after the response goes out the network?
IMHO choose the simplest solution that fits your requirements.
The clients should "poll" more frequently to make sure they always have the most recent copy of the data

Well strictly speaking a race condition is a bug. The solution to a race condition is not have shared data. If that is not to be avoided for a given use case, then a first come - first serve basis is usually helpful:
The first request locks the shared data and the second waits until the first is done with it.

Related

A good strategy/solution for concurrency with hibernate's optimistic/pessimistic locking

Let's presume that we have an application "mail client" and a front-end for it.
If a user is typing a message or editing the subject or whatever, a rest call is made to update whatever the user was changing (e.g. the receivers) to keep the message in DRAFT. So a lot PUT's are happening to save the message. When closing the window, an update of every editable field happens at the same time. Hibernate can't handle this concurrency: Each of those calls retrieve the message, edit their own fields and try to save the message again, while the other call already changed it.
I know I can add a rest call to save all fields at the same time, but I was wondering if there is a cleaner solution, or a decent strategy to handle such cases (like for example only updating one field or some merge strategy if the object has already changed)
Thanks in advance!
The easiest solutions here would be to tweak the UI to either
Submit a single rest call during email submission that does all the tasks necessary.
Serialize the rest calls so they're chained rather than firing concurrently.
The concern I have here is that this will snowball at some point and become a bigger concurrency problem as more users are interacting with the application. Consider for the moment the potential number of concurrent rest calls your web infrastructure will have to support alone when you're faced with a 100, 500, 1000, or even 10000 or more concurrent users.
Does it really make sense to beef up the volume of servers to handle that load when the load itself is a product of a design flaw in the first place?
Hibernate is designed to handle locking through two mechanisms, optimistic and pessimistic.
Optimistic Way
Read the entity from the data store.
Cache a copy of the fields you're going to modify in temporary variables.
Modify the field or fields based on your PUT operation.
Attempt to merge the changes.
If save succeeds, you're done.
Should an OptimisticLockException occur, refresh the entity state from data store.
Compare cached values to the fields you must change.
If values differ, you can assert or throw an exception
If they don't differ, go back to 4.
The beautiful part of the optimistic approach is you avoid any form of deadlock happening, particularly if you're allowing multiple tables to be read and locked separately.
While you can use pessimistic lock options, optimistic locking is generally the best accepted way to handle concurrent operations as it has the least concurrency contention and performance impact.

Optimizing async multi-request operations calling the same service

We are developing a document management web application and right now we are thinking about how to tackle actions on multiple documents. For example lets say a user multi selects 100 documents and wants to delete all of them. Until now (where we did not support multiple selection) the deleteDoc action does an ajax request to a deleteDocument service according to docId. The service in turn calls the corresponding utility function which does the required permission checking and proceeds to delete the document from the database. When it comes to multiple-deletion we are not sure what is the best way to proceed. We have come to many solutions but do not know which one is the best(-practice) and I'm looking for advice. Mind you, we are keen on keeping the back end code as intact as possible:
Creating a new multipleDeleteDocument service which calls the single doc delete utility function a number of times according to the amount of documents we want to delete (ugly in my opinion and counter-intuitive with modern practices).
Keep the back end code as is and instead, for every document, make an ajax request on the service.
Somehow (I have no idea if this is even possible) batch the requests into one but still have the server execute the deleteDocument service X amount of times.
Use WebSockets for the multi-delete action essentially cutting down on the communication overhead and time. Our application generally runs over lan networks with low latency which is optimal for websockets (when latency is introduced web sockets tend to match http request speeds).
Something we haven't thought of?
Sending N Ajax calls or N webSocket messages when all the data could be combined into a single call or message is never the most optimal solution so options 2 and 4 are certainly not ideal. I see no particular reason to use a webSocket over an Ajax call. If you already have a webSocket connection, then you can certainly just send a single delete message with a list of document IDs over the webSocket, but if an Ajax call could work just as well so I wouldn't create a webSocket connection just for this purpose.
Options 1 and 3 both require a new service endpoint that lets you make a single call to delete multiple documents. This would be recommended.
If I were designing an API like this, I'd design a single delete endpoint that takes one or more document IDs. That way the same API call can be used whether deleting a single document or multiple documents.
Then, from the client anytime you have multiple documents to delete, always collect them together and make one API call to delete all of them at once.
Internal to the server, how you implement that API depends upon your data store. If your data store also permits sending multiple documents to delete, then you would likewise call the data store that way. If it only supports single deletes, then you would just loop and delete each one individually.
Doing the option 3 would be the most elegant solution for me.
Assuming you send requests like POST /deleteDocument where you have docId as a parameter, you could instead pass an array of document ids to remove.
Then in backend you would only have to iterate through the list of ids and perform the deletion. You should be able keep the deletion code relatively intact.

Identifying duplicate requests fired from back end

I am facing a use case where I need to track down duplicate requests, which are fired through REST API calls from back end. Each request writes into the database, and hence the duplicate requests need not be processed again.
The duplicate requests may come in different threads under the same VM, or may be under different VM's altogether.The problem is how do I identify these duplicate requests ?
Approaches that I can think of :
Check in the database every time before processing an incoming request if the outcome of request is already what it is even if we process the request. If yes, then ignore the request else process it.
For every incoming request that has been processed, store it in a serialized format in a db mapped to a value (something like the hash index). Then, for every incoming request check if the db already has that request. If yes, then ignore else process it.
But both require db read operations. Can I do better ?
I don't think you can avoid DB operations in this case.
Your first approach is very project-specific one.
The second approach also cannot be applied to any code, because there might be cases when users send several equal requests and they both have to be processed.
A more general approach would be for the server to issue tokens, which are then passed with every request by the client to the server. The server in processing every request checks if the token which was passed in the request has already been used by someone. If not, mark in the DB that this token has been used and process the request. Otherwise ignore the request or send an error.
A client can obtain such a token by querying a server method (in this case there is no need to check any tokens for this particular request), or optionally the server can send a new token each time it responds a query.
You should also make sure to clean up outdated tokens once in a while to avoid polluting the database and collisions when generating new ones, if you generate tokens randomly. (See Birthday paradox).
The "double submit" is a common problem with web development. With standard forms a common idiom is submit-redirect-get which avoids a lot of problems.
I assume you're using javascript to fire requests to a REST backend? A simple approach to prevent one user from duplicating a request is to use javascript to disable the button for a small period of time after it's clicked.
However if you have to prevent this for multiple users, it is highly dependent on your model and other project details.

Denormalization in Google App Engine?

Background::::
I'm working with google app engine (GAE) for Java. I'm struggling to design a data model that plays to big table's strengths and weaknesses, these are two previous related posts:
Database design - google app engine
Appointments and Line Items
I've tentatively decided on a fully normalized backbone with denormalized properties added into entities so that most client requests can be serviced with only one query.
I reason that a fully normalized backbone will:
Help maintain data integrity if I code a mistake in the denormalization
Enable writes in one operation from a client's perspective
Allow for any type of unanticipated query on the data (provided one is willing to wait)
While the denormalized data will:
Enable most client requests to be serviced very fast
Basic denormalization technique:::
I watched an app engine video describing a technique referred to as "fan-out." The idea is to make quick writes to normalized data and then use the task queue to finish up the denormalization behind the scenes without the client having to wait. I've included the video here for reference, but its an hour long and theres no need to watch it in order to understand this question:
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
If I use this "fan-out" technique, every time the client modifies some data, the application would update the normalized model in one quick write and then fire off the denormalization instructions to the task queue so the client does not have to wait for them to complete as well.
Problem:::
The problem with using the task queue to update the denormalized version of the data is that the client could make a read request on data that they just modified before the task queue has completed the denormalization on that data. This would provide the client with stale data that is incongruent with their recent request confusing the client and making the application appear buggy.
As a remedy, I propose fanning out denormalization operations in parallel via asynchronous calls to other URLS in the application via URLFetch: http://code.google.com/appengine/docs/java/urlfetch/ The application would wait until all of the asynchronous calls had been completed before responding to the client request.
For example, if I have an "Appointment" entity and a "Customer" entity. Each appointment would include a denormalized copy of the customer information for who its scheduled for. If a customer changed their first name, the application would make 30 asynchronous calls; one to each affected appointment resource in order to change the copy of the customer's first name in each one.
In theory, this could all be done in parallel. All of this information could be updated in roughly the time it takes to make 1 or 2 writes to the datastore. A timely response could be made to the client after the denormalization was completed eliminating the possibility of the client being exposed to incongruent data.
The biggest potential problem I see with this is that the application can not have more than 10 asynchronous request calls going at any one time (documented here): http://code.google.com/appengine/docs/java/urlfetch/overview.html).
Proposed denormalization technique (recursive asynchronous fan-out):::
My proposed remedy is to send denormalization instructions to another resource that recursively splits the instructions into equal-sized smaller chunks, calling itself with the smaller chunks as parameters until the number of instructions in each chunk is small enough to be executed outright. For example, if a customer with 30 associated appointments changed the spelling of their first name. I'd call the denormalization resource with instructions to update all 30 appointments. It would then split those instructions up into 10 sets of 3 instructions and make 10 asynchronous requests to its own URL with each set of 3 instructions. Once the instruction set was less than 10, the resource would then make asynchronous requests outright as per each instruction.
My concerns with this approach are:
It could be interpreted as an attempt to circumvent app engine's rules, which would cause problems. (its not even allowed for a URL to call itself, so I'd in fact have to have two URL resources that handle the recursion that would call each other)
It is complex with multiple points of potential failure.
I'd really appreciate some input on this approach.
This sounds awfully complicated, and the more complicated the design the more difficult it is to code and maintain.
Assuming you need to denormalize your data, I'd suggest just using the basic denormalization technique, but keep track of which objects are being updated. If a client requests an object which is being updated, you know you need to query the database to get the updated data; if not, you can rely on the denormalized data. Once the task queue finishes, it can remove the object from the "being updated" list, and everything can rely on the denormalized data.
A sophisticated version could even track when each object was edited, so a given object would know if it had already been updated by the task queue.
It sounds like you are re-implemeting Materialized Views http://en.wikipedia.org/wiki/Materialized_view.
I suggest you the easy solution with Memcache. Uppon update from your client, you could save an Entity in the Memcache storing the Key of the updated Entity with the status 'updating'. When you task finisches, it will delete the Memcached status. Then you would check the status before a read, allowing the user to be correctly informed if the Entity is still 'locked'.

Best performing way to guarantee data consistency between concurrent web service calls?

Multiple clients are concurrently accessing a JAX-JWS webservice running on Glassfish or some other application server. Persistence is provided by something like Hibernate or OpenJPA. Database is Microsoft SQL Server 2005.
The service takes a few input parameters, some "magic" occurs, and then returns what is basically a transformed version of the next available value in a sequence, with the particular sequence and transformation being determined by the inputs. The "magic" that performs the transformation depends on the input parameters and various database tables (describing the relationship between the input parameters, the transformation, the sequence to get the next base value from, and the list of already served values for a particular sequence). Not sure if this could all be wrapped up in a stored procedure (probably), but also not sure if the client wants it there.
What is the best way to ensure consistency (i.e. each value is unique and values are consumed in order, with no opportunity for a value to reach a client without also being stored in the database) while maintaining performance?
It's hard to provide a complete answer without a full description (table schemas, etc.), but giving my best guess here as to how it works, I would say that you need a transaction around your "magic", which marks the next value in the sequence as in use before returning it. If you want to reuse sequence numbers then you can later unflag them (for example, if the user then cancels what they're doing) or you can just consider them lost.
One warning is that you want your transaction to be as short and as fast as possible, especially if this is a high-throughput system. Otherwise your sequence tables could quickly become a bottleneck. Analyze the process and see what the shortest transaction window is that will still allow you to ensure that a sequence isn't reused and use that.
It sounds like you have most of the elements you need here. One thing that might pose difficulty, depending on how you've implemented your service, is that you don't want to write any response to the browser until your database transaction has been safely committed without errors.
A lot of web frameworks keep the persistence session open (and uncommitted) until the response has been rendered to support lazy loading of persistent objects by the view. If that's true in your case, you'll need to make sure that none of that rendered view is delivered to the client until you're sure it's committed.
One approach is a Servlet Filter that buffers output from the servlet or web service framework that you're using until it's completed its work.

Categories