I have been reading a lot on ways to do aggregate queries on the datastore (thru stackoverflow and elsewhere). The preponderance of answers is that it cannot be done in a pleasant way. But then those answers are dated, and the same people tend to also claim that you cannot do things such as order by on the datastore.
As it exists today, you actually can specify ORDER BY on the datastore. So I am wondering if aggregation is also possible.
Consider the scenario where I have five candidates Alpha, Brave, Charie, Delta and Echo; and 10,000 voters. I want to retrieve the candidates and the number of votes each received in order. How would I do that on the datastore? I am using java.
Also, as an aside, if the answer is still no and fanning-in is my best option: is fan-in thread safe? By fanning-in I mean keeping an explicit counter that counts the vote each candidate receives (in a separate table). Could I experience a race condition or some other faults in the data when multiple users are voting concurrently?
If by aggregating you mean having the datastore compute the total # of votes for you, then no, the datastore won't do that.
The best way to do what you're describing is:
Create a set of sharded counters per candidate (google search for app engine sharded counters).
When someone votes, update the sharded counter for the given delegate.
When you want to read the votes, query for your delegates, then for each delegate, query for the sharded counters and sum them up.
Memcache for better performance, the GAE sharding counters example available in the docs shows this pretty well.
Its recently launched and available for use now: https://cloud.google.com/datastore/docs/aggregation-queries.
There are various client libraries also which support this particular feature.
Related
During localhost development the ID's generated by GAE, starts with 1.
However in a real GAE deployment in the cloud, the ID generated even for the firsts entities are quite long like, 5639412304721232, is there a work around to make the first entities to start with 1, 2, 3.. and so on?
One might suggest to use Sharded Counters, and yes I've used this, however some suggests that sharded counters are not to be used as app might get the same count as it is eventually consistent.
In this case what could be the best solution?
The official post explaining the switch from sequential to 'scattered' ids is here.
The instructions for reverting to sequential behaviour are here, but note the warning that this option will eventually be removed.
The 'best' solution depends on what you need and why. You'll get better datastore performance with scattered ids, but honestly, you might not notice much difference if your app makes gets a small number of requests and makes light use of the datastore. If that's the case, you can use roll your own sequential ids based on a simple entity with a property that holds the the current high watermark id, and rely on having a low transaction rate to keep you from running into limits on the number of transactions per entity.
Reliably handing out sequential ids without gaps in a distributed systems is challenging.
Be aware that you may run into problems if you create a lot of entities very quickly, with sequential Long IDs. This post gives you an explanation why.
In theory there's a choice of auto ID generation policies, with scattered IDs being the default since 1.8.1, but the old monotonically increasing legacy policy is to be deprecated for the reasons discussed in the linked post.
If you're using a sharded counter, you will avoid this but, as you say, you may encounter other issues.
You might try using allocate_ds. We use this to get smaller integer values for system generated ids. In Python using a db kind:
model_key = db.Key.from_path('your_kind_name', 1)
key_batch = db.allocate_ids(model_key, 1)
id_new = key_batch[0]
idkey = db.Key.from_path('your_kind_name', id_new)
I would assign the key's identifier as the strings "1", "2", "3"... and so on, generating them from a sequencer. You can check to see if the entity already exists with a get_or_insert() function.
Similarly, you can use the auto-increment solution by storing the sequence number in an entity.
Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?
I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.
I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.
I have an Android client that deals with product items and I would like to create an interface for displaying the most popular programs at any given time.
I have read and used shard counters to achieve highly scalable and parallel counting. This has been working well as far as counting is concerned.
However, the problem starts when it comes the time to calculate the top 10 most popular product items for a single request, I have to fetch them all product entities first, fetch the shard counters of each and add them up and then finally sort them to get the most popular ones.
The problem here is that in order to find out whats the most popular I have to recalculate all shard counters. Multiply that by 10000 product items and my request for a single user becomes slow as hell.
I've thought the idea of using a cron job to calculate the result and store that instead. Would you recommend me going that way? Has anyone else dealt with a similar situation?
Thanks!
Either regularly aggregate the counters into a single read-only value, as you suggest, or use an alternate way to keep high-concurrency counters, like this.
If you go with the former approach, you probably want to use a mapreduce triggered from a cronjob.
I've got a design question regarding Google's database Cloud Datastore. Let me explain it by using an example:
I've got Entities of the kind "Article" with the following properties:
title
userId
....
sumOfScore
SumOfScore should be the sum of all related "Score" entities, which have
properties like:
articleId
userId
score
In Pseudo-SQL:
sumOfScore = select sum(score) from Score where score.articleId = article.id
I see two possibilities to design this (using Google' datastore API):
1.) No property sumOfScore for Articles; but query always:
This means: Every time an article is read, I need to do an query for this specific article for calculating the sumOfScore.
Imagine a list of 100 Articles that is shown to a user. This would need additional 100 queries to the database, just to show the score for each article.
Nevertheless: This would be my preferred way when using a Relational-DB. No redundancy and good normalization.
And with SQL you can use just one join-select to catch all data.
But it doesn't feel right for Cloud Datastore.
2.) Calculate the sumOfScore whenever Score entities are changed:
This means: Whenever a Score-Entity is added, removed or changed, the related Article
updates the sumOfScore property.
Advantage: When reading articles no additional queries are needed. The sumOfScore is redundant on the entity itself.
Disadvantage: Every time a score is changed, there is one additional query and an additional write (updating an Article entity). And sumOfScore may mismatch with the actual Score entities (e.g. value is changed via DB-Console)
What are more experienced people think? Is there a common best practice for such scenario?
What are doing the JPA or JDO implementation under the hood?
Thanks a lot
Mos
The first thing I recommend you look into the GAE article about sharding counters.
That is an article from the GAE best practices relating to how you should be handling counters/sums. It can be a little tricky because every time you update an element you have to use logic to randomly pick a sharded counter; and when you retrieve your count you're actually fetching a group of entities and summing them. I've gone this route but won't provide code here on how I did it because I haven't battle tested it yet. But your code can get sloppy in a hurry if you just copy/paste the sample sharding code all over the place, so make an abstract or typed counter class to reuse your sharding logic if you decide to go this route.
Another alternative would be to use a fuzzy count. This method uses memcache and offers better performance at the cost of accuracy.
See the section here labeled "Transient and frequently updated data"
And the last alternative; is to just use SQL. Its experimental and hot out of the oven (in relation to being used on GAE) but it might be worth looking into.
Theres third possibility which doesn't make a compromise.
You make Score a child of Article, and keep the sumOfScore in Article. For sorting purposes, this field will come in handy. As this two classes are from the same entity group, you can create a Score and update the Article in a transaction. You could even double check by querying all the Score who's parent is a given Article.
The problem with this approach, is that you can only update an entity 5 times per second. If you think you'll have much more activity than that (remember, it's just a limitation on a single entity not the entier table), you should check out sharded counter tutorial or see the google io's video explaining this..
edit:
Heres a great discussion about this same topic: How does Google Moderator avoid contention?