GAE/J Shard counter calculation for live product ratings - java

I have an Android client that deals with product items and I would like to create an interface for displaying the most popular programs at any given time.
I have read and used shard counters to achieve highly scalable and parallel counting. This has been working well as far as counting is concerned.
However, the problem starts when it comes the time to calculate the top 10 most popular product items for a single request, I have to fetch them all product entities first, fetch the shard counters of each and add them up and then finally sort them to get the most popular ones.
The problem here is that in order to find out whats the most popular I have to recalculate all shard counters. Multiply that by 10000 product items and my request for a single user becomes slow as hell.
I've thought the idea of using a cron job to calculate the result and store that instead. Would you recommend me going that way? Has anyone else dealt with a similar situation?
Thanks!

Either regularly aggregate the counters into a single read-only value, as you suggest, or use an alternate way to keep high-concurrency counters, like this.
If you go with the former approach, you probably want to use a mapreduce triggered from a cronjob.

Related

GAE Long ID's too long is there a way to get shorter Long ID's?

During localhost development the ID's generated by GAE, starts with 1.
However in a real GAE deployment in the cloud, the ID generated even for the firsts entities are quite long like, 5639412304721232, is there a work around to make the first entities to start with 1, 2, 3.. and so on?
One might suggest to use Sharded Counters, and yes I've used this, however some suggests that sharded counters are not to be used as app might get the same count as it is eventually consistent.
In this case what could be the best solution?
The official post explaining the switch from sequential to 'scattered' ids is here.
The instructions for reverting to sequential behaviour are here, but note the warning that this option will eventually be removed.
The 'best' solution depends on what you need and why. You'll get better datastore performance with scattered ids, but honestly, you might not notice much difference if your app makes gets a small number of requests and makes light use of the datastore. If that's the case, you can use roll your own sequential ids based on a simple entity with a property that holds the the current high watermark id, and rely on having a low transaction rate to keep you from running into limits on the number of transactions per entity.
Reliably handing out sequential ids without gaps in a distributed systems is challenging.
Be aware that you may run into problems if you create a lot of entities very quickly, with sequential Long IDs. This post gives you an explanation why.
In theory there's a choice of auto ID generation policies, with scattered IDs being the default since 1.8.1, but the old monotonically increasing legacy policy is to be deprecated for the reasons discussed in the linked post.
If you're using a sharded counter, you will avoid this but, as you say, you may encounter other issues.
You might try using allocate_ds. We use this to get smaller integer values for system generated ids. In Python using a db kind:
model_key = db.Key.from_path('your_kind_name', 1)
key_batch = db.allocate_ids(model_key, 1)
id_new = key_batch[0]
idkey = db.Key.from_path('your_kind_name', id_new)
I would assign the key's identifier as the strings "1", "2", "3"... and so on, generating them from a sequencer. You can check to see if the entity already exists with a get_or_insert() function.
Similarly, you can use the auto-increment solution by storing the sequence number in an entity.

MongoDB related scaling issue

Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?
I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.

google datastore aggregate query

I have been reading a lot on ways to do aggregate queries on the datastore (thru stackoverflow and elsewhere). The preponderance of answers is that it cannot be done in a pleasant way. But then those answers are dated, and the same people tend to also claim that you cannot do things such as order by on the datastore.
As it exists today, you actually can specify ORDER BY on the datastore. So I am wondering if aggregation is also possible.
Consider the scenario where I have five candidates Alpha, Brave, Charie, Delta and Echo; and 10,000 voters. I want to retrieve the candidates and the number of votes each received in order. How would I do that on the datastore? I am using java.
Also, as an aside, if the answer is still no and fanning-in is my best option: is fan-in thread safe? By fanning-in I mean keeping an explicit counter that counts the vote each candidate receives (in a separate table). Could I experience a race condition or some other faults in the data when multiple users are voting concurrently?
If by aggregating you mean having the datastore compute the total # of votes for you, then no, the datastore won't do that.
The best way to do what you're describing is:
Create a set of sharded counters per candidate (google search for app engine sharded counters).
When someone votes, update the sharded counter for the given delegate.
When you want to read the votes, query for your delegates, then for each delegate, query for the sharded counters and sum them up.
Memcache for better performance, the GAE sharding counters example available in the docs shows this pretty well.
Its recently launched and available for use now: https://cloud.google.com/datastore/docs/aggregation-queries.
There are various client libraries also which support this particular feature.

Java EE record product impressions

Part of my project requires that we maintain stats of our customers products. More or less we want to show our customers how often their products has been viewed on the site
Therefore we want to create some form of Product Impressions Counter. I do not just mean a counter when we land on the specific product page, but when the product appears in search results and in our product directory lists.
I was thinking that after calling the DB I would extract the specific product ids and pass them to a service that will then inserted then into the stats tables. Or another is using some form of singleton buffer writer which writes to the DB after it reaches a certains size?
Has anyone ever encountered this in there projects and have any ideas that they would like to share?
And / or does anyone know of any framework or tools that could aid this development?
Any input would be really appreciated.
As long as you don't have performance problems, do not over-engineer your design. On the other hand, depending on how big the site is, it seem that you are going to have performance problems due to huge amount of writes.
I think real time updates will have a huge performance impact. Also it is very likely that you will update the same data multiple times in short period of time. Another thing is that, although interesting, storing this statistics is not mission-cricital and it shouldn't affect normal system work. Final thought: inconsistencies and minor inaccuracy is IMHO acceptable in this use case.
Taking all this into account I would temporarily hold the statistics in memory and flush them periodically as you've suggested. This has the additional benefit of merging events for the same product - if between two flushed some product was visited 10 times, you will only perform one update, not 10.
Technically, you can use properly synchronized singleton with background thread (a lot of handcrafting) or some intelligent cache with write-behind technology.

What is the best way to handle one to many relationships in the low level datastore api?

I've been using the low level datastore API for App Engine in Java for a while now and I'm trying to figure out the best way to handle one to many relationships. Imagine a one to many relationship like "Any one student can have zero or more computers, but every computer is owned by exactly one student".
The two options are to:
have the student entity store a list of Keys of the computers associated with the student
have the computer entity store a single Key of the student who owns the computer
I have a feeling option two is better but I am curious what other people think.
The advantage of option one is that you can get all the 'manys' back without using a Query. One can ask the datastore for all entities using get() and passing in the stored list of keys. The problem with this approach is that you cannot have the datastore do any sorting of the values that get returned from get(). You must do the sorting yourself. Plus, you have to manage a list rather than a single Key.
Option two seems nice because there is no list to maintain. Also, you can sort by properties of the computer as long as their is an index for that property. Imagine trying to get all the computers for a student where the results are sorted by purchase date. With approach two it is a simple query, no sorting is done in our code (the datastore's index takes care of it)
Sorting is not really hard, but a little more time consuming (~O(nlogn) for a sort) than having a sorted index (~O(n) for going through the index). The tradeoff is an index (space in the datastore) for processing time. As I said my instinct tells me option two is a better general solution because it gives the developer a little more flexibility in getting results back in order at the cost of additional indexes (which with the google pricing model are pretty cheap). Does anyone agree, disagree, or have comments?
Both approaches are valid in different situations, though option two - storing a single reference on the 'many' side - is the more common approach. Which you use depends on how you need to access your data.
Have you considered doing both? Then you could quickly get a list of computers a student owns by key OR use a query which returns results in some sorted order. I don't think maintaining a list of keys on the student model is as intimidating as you think.
Don't underestimate the benefit of fetching entities directly by keys. According to this article, this can be 4-5x faster than queries.

Categories