Denormalization in Google App Engine?

Denormalization in Google App Engine? - java

Background::::
I'm working with google app engine (GAE) for Java. I'm struggling to design a data model that plays to big table's strengths and weaknesses, these are two previous related posts:
Database design - google app engine
Appointments and Line Items
I've tentatively decided on a fully normalized backbone with denormalized properties added into entities so that most client requests can be serviced with only one query.
I reason that a fully normalized backbone will:
Help maintain data integrity if I code a mistake in the denormalization
Enable writes in one operation from a client's perspective
Allow for any type of unanticipated query on the data (provided one is willing to wait)
While the denormalized data will:
Enable most client requests to be serviced very fast
Basic denormalization technique:::
I watched an app engine video describing a technique referred to as "fan-out." The idea is to make quick writes to normalized data and then use the task queue to finish up the denormalization behind the scenes without the client having to wait. I've included the video here for reference, but its an hour long and theres no need to watch it in order to understand this question:
http://code.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
If I use this "fan-out" technique, every time the client modifies some data, the application would update the normalized model in one quick write and then fire off the denormalization instructions to the task queue so the client does not have to wait for them to complete as well.
Problem:::
The problem with using the task queue to update the denormalized version of the data is that the client could make a read request on data that they just modified before the task queue has completed the denormalization on that data. This would provide the client with stale data that is incongruent with their recent request confusing the client and making the application appear buggy.
As a remedy, I propose fanning out denormalization operations in parallel via asynchronous calls to other URLS in the application via URLFetch: http://code.google.com/appengine/docs/java/urlfetch/ The application would wait until all of the asynchronous calls had been completed before responding to the client request.
For example, if I have an "Appointment" entity and a "Customer" entity. Each appointment would include a denormalized copy of the customer information for who its scheduled for. If a customer changed their first name, the application would make 30 asynchronous calls; one to each affected appointment resource in order to change the copy of the customer's first name in each one.
In theory, this could all be done in parallel. All of this information could be updated in roughly the time it takes to make 1 or 2 writes to the datastore. A timely response could be made to the client after the denormalization was completed eliminating the possibility of the client being exposed to incongruent data.
The biggest potential problem I see with this is that the application can not have more than 10 asynchronous request calls going at any one time (documented here): http://code.google.com/appengine/docs/java/urlfetch/overview.html).
Proposed denormalization technique (recursive asynchronous fan-out):::
My proposed remedy is to send denormalization instructions to another resource that recursively splits the instructions into equal-sized smaller chunks, calling itself with the smaller chunks as parameters until the number of instructions in each chunk is small enough to be executed outright. For example, if a customer with 30 associated appointments changed the spelling of their first name. I'd call the denormalization resource with instructions to update all 30 appointments. It would then split those instructions up into 10 sets of 3 instructions and make 10 asynchronous requests to its own URL with each set of 3 instructions. Once the instruction set was less than 10, the resource would then make asynchronous requests outright as per each instruction.
My concerns with this approach are:
It could be interpreted as an attempt to circumvent app engine's rules, which would cause problems. (its not even allowed for a URL to call itself, so I'd in fact have to have two URL resources that handle the recursion that would call each other)
It is complex with multiple points of potential failure.
I'd really appreciate some input on this approach.

This sounds awfully complicated, and the more complicated the design the more difficult it is to code and maintain.
Assuming you need to denormalize your data, I'd suggest just using the basic denormalization technique, but keep track of which objects are being updated. If a client requests an object which is being updated, you know you need to query the database to get the updated data; if not, you can rely on the denormalized data. Once the task queue finishes, it can remove the object from the "being updated" list, and everything can rely on the denormalized data.
A sophisticated version could even track when each object was edited, so a given object would know if it had already been updated by the task queue.

It sounds like you are re-implemeting Materialized Views http://en.wikipedia.org/wiki/Materialized_view.

I suggest you the easy solution with Memcache. Uppon update from your client, you could save an Entity in the Memcache storing the Key of the updated Entity with the status 'updating'. When you task finisches, it will delete the Memcached status. Then you would check the status before a read, allowing the user to be correctly informed if the Entity is still 'locked'.

Related

How to deal with multiple database results from different servers for a request

I have cloud statistics (Structured data :: CSV) information; which i have to expose to administrator and user.
But for scalability; data collection will be collected by multiple machines (perf monitor) which is connected with individual DBs.
Now Manager (Mgr) is responsible of multicasting the request to all perf monitor; to collect the overall stats data to satisfy single UI request.
So questions are:
1) How will i make the mutiple monitor datas to be sorted based on
the client request at Mgr. Each monitor may give the result as per the client
request; but still how to merge multiple machines datas through java?
Means How to perform in memory sql aggregate/scalar (e.g. Groupby, orderby, avg) function on all the results retrieved from multiple clusters at MGR. How do i implement DB sql aggregate/scalar functionality in java side, any known APIs?
I think what i need is Reduce part of mapreduce technique in hadoop.
2) A request from UI (assume select count(*) from DB where Memory >
1000MB) have to be forwarded to multiple machines. Now how to send parallel
requests to individual monitor and consume only when all the nodes
are responded? Means how to wait User thread till consuming all the
responses from perf monitors? How to trigger parallel REST request for single UI request on MGR.
3) Do I have to authenticate UI user at both Mgr and Perf monitor?
4) Are you thinking any drawback in this approach?
Notes:
1) I didn't go for NoSql because datas are structured and no joins are required.
2) I didn't go for node.js since i am new for that and may take more time on developing it. Also i am not developing any concurrent critical where single threaded are best suited. Here only push/retrieve of data is done. No modification happening.
3) I want individual DB for each monitor OR at-least two instances of DB's with multiple clusters for an instance to support faster accessing of real time BIG statistical data.

You want to scale your app, but you designed an inherent bottleneck. Namely: the Mgr.
What I would do is that I would split the Mgr into at least two parts. Front-end and backend. The front end could simply be an aggregator and/or controller which collects all the requests from all the different UI servers, timestamps those requests and put them in a queue (RabbitMQ, Kafka, Redis, whatever) making a message with the UI session ID or something similar which uniquely identifies the source of request. Then you just have to wait until you get a response on the queue (with a different topic of course).
Then on your backend (the other side of the queue) you can set up as many nodes as your load requires and make them performing the same task. Namely: pull off requests from the queue and call those performance monitoring APIs as necessary. You can scale these backend nodes as much as you wish since they don't have any state, all the state which needs to be stored is already part of the messages in the queue which will be automagically persisted for you by Redis/Kafka/RabbitMQ or whatever else you choose.
You can also use Apache Storm or something similar to do this for you in the backend, since it was designed for exactly this kind of applications.
Apache Storm has also built-in merging capability exposed through the Trident API.
Note on the authentication: you should authenticate the HTTP requests on the front-end side and then you will be all right. Just assign unique IDs (session IDs most probably) to the users connected to your mgr and use this internal ID when you forward your requests further to downstream servers.
Now how to send parallel requests to individual monitor and consume
only when all the nodes are responded? Means how to wait User thread
till consuming all the responses from perf monitors? How to trigger
parallel REST request for single UI request on MGR.
Well if you have so many questions regarding handling user connections and serving those clients with responses then I would suggest to pick up a book on the Java servlets API. You might want to read this one for example: Servlet & JSP: A Tutorial (A Tutorial series). It is a bit outdated but well written.
But with all due respect, if you have so many questions on these quite fundamental topics, then it might be better to leave the architecture design to someone more experienced.

Don't reinvent the wheel, use some good existing BAM and Database monitoring tools, they have lot of built in dashboards and statistics, easy to connect with Java and work-flows.

But for scalability; data collection will be collected by multiple
machines (perf monitor) which is connected with individual DBs.
Approximately what sort of scaling do you anticipate ... is it 100s of GB's Multiple Terra Bytes .... Reason is these days SQL Server and Oracle can handle really large volumes of data. Once data is collected in a central db its game over as far as searching and crunching are concerned.
Now Manager (Mgr) is responsible of multicasting the request to all
perf monitor; to collect the overall stats data to satisfy single UI
request.
This will be a major task to write this and it will be really complex IMHO. That said Iam not an expert in this aspect.

What I would do is to put a layer of Hazelcast or Infinispan or something like this in your Performance Monitor instead of the Hazelcast. The Performance monitor itself like a logic can be part of the DataGrid. Then the MySQL will work as a persistent storage of this data grid. In this sense you can have more then one Mysql and each mysql will just hold a portion of the data It will just work as extension ability to go beyond your maximum RAM. Overtime you scale your performance monitor you will also scale your persistent capabilities.
Young then Map Reduce or other distributed functions for aggregation can lead to massive amount of paralelism and ability to server significantly more requests. Also such architecture scales horizontal. At the end it should look something like this:
And just on another note to say that it is not necessary in general to have 1 MySQL for each hazelcast. That depends on what the goal is. I also kind of forgot the Manager from the diagram but things there are simple it can either work as a gateway to the Data Grid or alternatively it can be merged with the grid.

Not sure if my answer would be useful for you since this question has been posted sometimes back.
I would like to answer it based on your question, problems in the current approach and proposed solution...
1) How will i make the mutiple monitor datas to be sorted based on the
client request at Mgr. Each monitor may give the result as per the
client request; but still how to merge multiple machines datas through
java? Means How to perform in memory sql aggregate/scalar (e.g.
Groupby, orderby, avg) function on all the results retrieved from
multiple clusters at MGR. How do i implement DB sql aggregate/scalar
functionality in java side, any known APIs? I think what i need is
Reduce part of mapreduce technique in hadoop.
Java provided in-build Java DB as part of Java distribution which is also available as Apache Derby database. This database can be used as in-memory SQL database. JavaDB & Apache Derby stores the data into disk. So you won't loose the data after restart.
Check here http://www.oracle.com/technetwork/java/javadb/overview/index.html https://db.apache.org/derby/
For Map-Reduce simple Java collection based approached would work. I don't think you need any special Map-Reduce framework in this case. You should however consider Out Of Memory, Network bandwidth etc. when you read data from multiple sources
2) A request from UI (assume select count(*) from DB where Memory >
1000MB) have to be forwarded to multiple machines. Now how to send
parallel requests to individual monitor and consume only when all the
nodes are responded? Means how to wait User thread till consuming all
the responses from perf monitors? How to trigger parallel REST request
for single UI request on MGR.
Ideally NodeJS kind of application are really best suite in this case where application get callback whenever there is a response of the HTTP call. However you can implement Observer Pattern like explained here How do I perform a JAVA callback between classes?
3) Do I have to authenticate UI user at both Mgr and Perf monitor?
It should be based on your requirement
4) Are you thinking any drawback in this approach?
There are several drawbacks with this approach
Data should not be pulled on-demand from UI. At-least data should be available in the centralised database whenever there is a request to generate the data. Pulling data from various end-points is expensive.
Stats must be collected periodically to maintain history and reports must be generated based on the moving time window.
JVM might go OutOfMemory if large data needs to be process. Proper handling is required.
Large data might get transferred over the network every time there is a new request. It might be for the same data again.
Notes:
1) I didn't go for NoSql because datas are structured and no joins are
required.
No SQL doesn't mean there is not structure followed. Even NoSQL database is the best fit for such data where you don't update the records, transactions etc are not required.
2) I didn't go for node.js since i am new for that and may take more
time on developing it. Also i am not developing any concurrent
critical where single threaded are best suited. Here only
push/retrieve of data is done. No modification happening.
NodeJS won't be a good choice since it is single threaded. NodeJS should not be used when you have CPU intensive job to perform. Like yours.
3) I want individual DB for each monitor OR at-least two instances of
DB's with multiple clusters for an instance to support faster
accessing of real time BIG statistical data.
**I would rather suggest you to either store data into any database which can horizontally scale, process the data either as and when it arrives or batch processing so that your user experience is good. **

Optimizing async multi-request operations calling the same service

We are developing a document management web application and right now we are thinking about how to tackle actions on multiple documents. For example lets say a user multi selects 100 documents and wants to delete all of them. Until now (where we did not support multiple selection) the deleteDoc action does an ajax request to a deleteDocument service according to docId. The service in turn calls the corresponding utility function which does the required permission checking and proceeds to delete the document from the database. When it comes to multiple-deletion we are not sure what is the best way to proceed. We have come to many solutions but do not know which one is the best(-practice) and I'm looking for advice. Mind you, we are keen on keeping the back end code as intact as possible:
Creating a new multipleDeleteDocument service which calls the single doc delete utility function a number of times according to the amount of documents we want to delete (ugly in my opinion and counter-intuitive with modern practices).
Keep the back end code as is and instead, for every document, make an ajax request on the service.
Somehow (I have no idea if this is even possible) batch the requests into one but still have the server execute the deleteDocument service X amount of times.
Use WebSockets for the multi-delete action essentially cutting down on the communication overhead and time. Our application generally runs over lan networks with low latency which is optimal for websockets (when latency is introduced web sockets tend to match http request speeds).
Something we haven't thought of?

Sending N Ajax calls or N webSocket messages when all the data could be combined into a single call or message is never the most optimal solution so options 2 and 4 are certainly not ideal. I see no particular reason to use a webSocket over an Ajax call. If you already have a webSocket connection, then you can certainly just send a single delete message with a list of document IDs over the webSocket, but if an Ajax call could work just as well so I wouldn't create a webSocket connection just for this purpose.
Options 1 and 3 both require a new service endpoint that lets you make a single call to delete multiple documents. This would be recommended.
If I were designing an API like this, I'd design a single delete endpoint that takes one or more document IDs. That way the same API call can be used whether deleting a single document or multiple documents.
Then, from the client anytime you have multiple documents to delete, always collect them together and make one API call to delete all of them at once.
Internal to the server, how you implement that API depends upon your data store. If your data store also permits sending multiple documents to delete, then you would likewise call the data store that way. If it only supports single deletes, then you would just loop and delete each one individually.

Doing the option 3 would be the most elegant solution for me.
Assuming you send requests like POST /deleteDocument where you have docId as a parameter, you could instead pass an array of document ids to remove.
Then in backend you would only have to iterate through the list of ids and perform the deletion. You should be able keep the deletion code relatively intact.

Google Cloud Endpoints with Android - Caching response?

I am using Google Cloud Endpoints (Java) as a backend for an Android app. The data on the backend is stored to Datastore.
The problem is that the Android client often doesn't receive correct data from the backend. After repeating the request, the correct data is eventually received.
Can you tell me what could cause this problem ? Is there some kind of response caching in the Cloud Endpoints library ? Or caching within Datastore ?
I have read about eventual consistency provided by Datastore, so I am trying to wait at least 10 seconds after updating the entities before the next request with query. Is it enough ?
Thanks a lot.

It sounds as though the way you structured your data model is not compliant with the way in which you want to use the data model. Without seeing an example of your data model I can't give you hints as to how to change it but here is a simple example that might help you.
Suppose I have a "Post" entity and a "CommentForPost" entity. Users can create many CommentForPost entries for each Post entry. Logically, the CommentForPost is a child of Post but if you don't make the Post an ancestor of all the CommentForPost entries then immediate consistency is not guaranteed. If user A creates a comment and then queries for it they may not see the entry right away.
On the other hand, if you make the Post entity the parent of all of the CommentForPost entries then you will guarantee immediate consistency because they are being saved as a single EntityGroup. If user A creates an entry and immediately queries for it (using the Post key as an ancestor) then they are for sure going to get accurate data back.
The limitation here is that you would only be able to create one CommentForPost entry per second. This is where the give and take lays. If you need a lot of updates within short periods of time then you can't guarantee consistency (e.g. 100 users are all adding CommentForPost entries at the same time). If you want to guarantee consistency then you can't have a lot of updates within a short period of time.
Make sense?

Straight from the docs at Structuring Data for Strong Consistency: If you always require the result of a Datastore query to be consistent, you will need to use an ancestor query as per the example code in the article:
Query query = new Query("Greeting", guestbookKey).setAncestor(guestbookKey);
You can't rely on a delay to obtain consistency as it depends on replication across data centers and the amount of time this takes is never, well, consistent :)

JMS message. Model to include data or pointers to data?

I am trying to resolve a design difference of opinion where neither of us has experience with JMS.
We want to use JMS to communicate between a j2ee application and the stand-alone application when a new event occurs. We would be using a single point-to-point queue. Both sides are Java-based. The question is whether to send the event data itself in the JMS message body or to send a pointer to the data so that the stand-alone program can retrieve it. Details below.
I have a j2ee application that supports data entry of new and updated persons and related events. The person records and associated events are written to an Oracle database. There are also stand-alone, separate programs that contribute new person and event records to the database. When a new event occurs through any of 5-10 different application functions, I need to notify remote systems through an outbound interface using an industry-specific standard messaging protocol. The outbound interface has been designed as a stand-alone application to support scalability through asynchronous operation and by moving it to a separate server.
The j2ee application currently has most of the data in memory at the time the event is entered. The data would consist of approximately 6 different objects; a person object and some with multiple instances for an average size in the range of 3000 to 20,000 bytes. Some special cases could be many times this amount.
From a performance and reliability perspective, should I model the JMS message to pass all the data needed to create the interface message, or model the JMS message to contain record keys for the data and have the stand-alone Java application retrieve the data to create the interface message?

I wouldn't just focus on performance for the decision, but also on other non-functional considerations.
I've been working on a system where we decided to not send the data in the message, but rather the PK of the data in database. Our approach was closer to the command message pattern. Our choice was motivated by the following reasons:
Data size: we would store the data in BLOB because it could bu hughe. In your case, the size of the data probably fit in a message anayway.
Message loss: we planned for the worse. If the messages were lost, we could recover the data and we had a recovery procedure to resubmit the messages. Looks maybe paranoid, but here are two scenario that could lead to some message being lost: (1) queue is purged by mistake (2) an error occurs and messages can't be delivered for a long time. They go to the dead message queue (DMQ) which eventually reaches its limit and start discarding messages, if not configured correctly.
Monitoring: different messages/command could update the same row in database. That was easy to monitor and troubleshoot.
Using a JMS + database did however complicates a bit the design:
distributed transactions: this adds some complexity, and sometimes some problems. Distributed transactions have subtle differences with "regular" transactions, such as distributed timeout.
persitency: the code is less intuitive. Data must first be persisted to have the PK, which leads to some complexity in the code if an ORM is used.
I guess both approaches can work. I've described what led us to not send the data in the message, but your system and requirements might be different, so it might still be easier to send the data in the message in your case. I can not provide a definitive answer, but I hope it helps you make your decision.

Send the data, not the pointer. I wouldn't consider your messages to be an extraordinary size that can't be handled.

It will be no problem for the queue to handle the data, the messages in the queue are persisted anyway (memory, file or database persistence whatever fits better for the size of your queue).
If you just put a handle to the data in the queue the application that process the queue will make unnecessary work to get the data that the sender already has.

Depending on your question I cannot say what's the best in your case. Sure there are performance implications because of the message size and stuff, but first you need to know which information needs to be sent to the remote system by your message consumer, especially in a system which may have concurring updates on the same data.
It is relevant whether you need to keep the information stored in the remote system in sync with the version of the record just stored in your database, and whether you want to propagate a complete history along to the remote system which is updated by the message reciever. As a lot of time may pass in between the message send and the processing on the other end of the queue.
Assume (for some reason) there are a whole lot of messages in the queue, and within a few seconds or minutes three or four update notifications on the same object hit the queue. Assume the first message is processed after the fourth update to the record was finished, and its update notification is put in the queue. When you only pass along the ID of the record, all four messages would perform exactly the same operation on the remote system, which for one is absolutely superfluous. In addition, the remote system sees four updates, all the same,but has no information of the three intermediating states of the object, thus, the history, if relevant, is lost for this system.
Beside these semantic implications, technical reasons for passing the id or the whole data are whether it's cheaper to unwrap the updated information from the message body or to load them from the database. This depends on how you want to serialize/deserialize the contents. The message sizes you provided should be no problem for decent JMS implementation when you want to send the data along.
When serializing java objects into messages you need to hold the class format in sync between sender and consumer, and you have to empty the queue before you can update to a newer version of the class on the consuming site. Of course the same counts for database updates when you just pass along the id.
When you just send the ID to the consumer you will have additional database connections, this might also be relevant depending on the load on the database and how complex the queries are you need to execute to get the objects.

Best performing way to guarantee data consistency between concurrent web service calls?

Multiple clients are concurrently accessing a JAX-JWS webservice running on Glassfish or some other application server. Persistence is provided by something like Hibernate or OpenJPA. Database is Microsoft SQL Server 2005.
The service takes a few input parameters, some "magic" occurs, and then returns what is basically a transformed version of the next available value in a sequence, with the particular sequence and transformation being determined by the inputs. The "magic" that performs the transformation depends on the input parameters and various database tables (describing the relationship between the input parameters, the transformation, the sequence to get the next base value from, and the list of already served values for a particular sequence). Not sure if this could all be wrapped up in a stored procedure (probably), but also not sure if the client wants it there.
What is the best way to ensure consistency (i.e. each value is unique and values are consumed in order, with no opportunity for a value to reach a client without also being stored in the database) while maintaining performance?

It's hard to provide a complete answer without a full description (table schemas, etc.), but giving my best guess here as to how it works, I would say that you need a transaction around your "magic", which marks the next value in the sequence as in use before returning it. If you want to reuse sequence numbers then you can later unflag them (for example, if the user then cancels what they're doing) or you can just consider them lost.
One warning is that you want your transaction to be as short and as fast as possible, especially if this is a high-throughput system. Otherwise your sequence tables could quickly become a bottleneck. Analyze the process and see what the shortest transaction window is that will still allow you to ensure that a sequence isn't reused and use that.

It sounds like you have most of the elements you need here. One thing that might pose difficulty, depending on how you've implemented your service, is that you don't want to write any response to the browser until your database transaction has been safely committed without errors.
A lot of web frameworks keep the persistence session open (and uncommitted) until the response has been rendered to support lazy loading of persistent objects by the view. If that's true in your case, you'll need to make sure that none of that rendered view is delivered to the client until you're sure it's committed.
One approach is a Servlet Filter that buffers output from the servlet or web service framework that you're using until it's completed its work.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.