Counting Unique Users using Mapreduce for Java Appengine

Counting Unique Users using Mapreduce for Java Appengine - java

I'm trying to count the number of unique users per day on my java appengine app. I have decided to use the mapreduce framework (mapreduce.appspot.com) for java appengine to do this calculation offline. I've managed to create a map reduce job that goes through all of my entities which represent a single users session event. I can use a simple counter as well. I have several questions though:
1) How do I only increment a counter once for each user id? I am currently mapping over entities which contain a user id property but many of these entities may contain the same user id so how do I only count it once?
2) Once I have these results of the job stored in these counters - how can I persist them to the datastore? I see the results of the counters on the mapreduce's status page but I want these results automatically persisted to the datastore.
Ideas?

I haven't actually used the MapReduce functionality yet, but my theoretical understanding is that you can write things to the datastore from within your mapper. You could create an Entity type called something like UniqueCount, and insert one entity every time your mapper sees an ID that it hasn't seen before. then you can count how many unique ID's you have. In fact, you can just update a counter every time you find a new unique entity. You may want to google "sharded counter" for hints on creating a counter in the datastore that can handle high throughput.
Eventually, when they finish the Reduce functionality, I imagine this whole task will become pretty trivial.

Related

GAE Long ID's too long is there a way to get shorter Long ID's?

During localhost development the ID's generated by GAE, starts with 1.
However in a real GAE deployment in the cloud, the ID generated even for the firsts entities are quite long like, 5639412304721232, is there a work around to make the first entities to start with 1, 2, 3.. and so on?
One might suggest to use Sharded Counters, and yes I've used this, however some suggests that sharded counters are not to be used as app might get the same count as it is eventually consistent.
In this case what could be the best solution?

The official post explaining the switch from sequential to 'scattered' ids is here.
The instructions for reverting to sequential behaviour are here, but note the warning that this option will eventually be removed.
The 'best' solution depends on what you need and why. You'll get better datastore performance with scattered ids, but honestly, you might not notice much difference if your app makes gets a small number of requests and makes light use of the datastore. If that's the case, you can use roll your own sequential ids based on a simple entity with a property that holds the the current high watermark id, and rely on having a low transaction rate to keep you from running into limits on the number of transactions per entity.
Reliably handing out sequential ids without gaps in a distributed systems is challenging.

Be aware that you may run into problems if you create a lot of entities very quickly, with sequential Long IDs. This post gives you an explanation why.
In theory there's a choice of auto ID generation policies, with scattered IDs being the default since 1.8.1, but the old monotonically increasing legacy policy is to be deprecated for the reasons discussed in the linked post.
If you're using a sharded counter, you will avoid this but, as you say, you may encounter other issues.

You might try using allocate_ds. We use this to get smaller integer values for system generated ids. In Python using a db kind:
model_key = db.Key.from_path('your_kind_name', 1)
key_batch = db.allocate_ids(model_key, 1)
id_new = key_batch[0]
idkey = db.Key.from_path('your_kind_name', id_new)

I would assign the key's identifier as the strings "1", "2", "3"... and so on, generating them from a sequencer. You can check to see if the entity already exists with a get_or_insert() function.
Similarly, you can use the auto-increment solution by storing the sequence number in an entity.

How to find the value that has maximum duplicates in app engine datastore using Java?

I have a datastore that stores the cab booking details of the customers. In the admin console I need to display the statistics to the admin, like busiest location, peak hours, total bookings in a particular location in a particular day. For the busiest location i need to retrieve the location from where most number of cabs has been booked. Should I iterate through the entire datastore and keep a count or is there any method to know which location has maximum and minimum duplicates.
I am using a ajax call to java servlet which should return the busiest location.
And I also need a suggestion for maintaining such a stats page. Should I keep a separate Entity kind just for counters and stats and update it everytime when a customer books a cab or is the logic correct for iterating through the entire datastore for the stats page. Thanks in advance.

There are too many unknowns about your data model and usage patterns to offer a specific solution, but I can offer a few tips.
Updating a counter every time you create a new record will increase your writing costs by 2 write operations, which may or may not be significant.
Using keys-only queries is very cheap and fast. It is the preferred method for counting something, so you should try to model your data in such a way that a keys-only query can give you an answer. For example, if a "trip" entity has a property for "id of a starting point", and this property is indexed, you can loop through your locations using a keys-only query to count the number of trips that started from each location.
Assuming that you record a lot of trips, and that an admin page will be visited/refreshed not very frequently, the keys-only queries approach is the way to go. If the admin page is visited/refreshed many times per hour, you may be better off with the counters.

Java - Google App Engine - modelling graph structures in Google Datastore

Google Apps Engine offers the Google Datastore as the only NoSQL database (I think it is based on BigTable).
In my application I have a social-like data structure and I want to model it as I would do in a graph database. My application must save heterogeneous objects (users,files,...) and relationships among them (such as user1 OWNS file2, user2 FOLLOWS user3, and so on).
I'm looking for a good way to model this typical situation, and I thought to two families of solutions:
List-based solutions: Any object contains a list of other related objects and the object presence in the list is itself the relationship (as Google said in the JDO part https://developers.google.com/appengine/docs/java/datastore/jdo/relationships).
Graph-based solution: Both nodes and relationships are objects. The objects exist independently from the relationships while each relationship contain a reference to the two (or more) connected objects.
What are strong and weak points of these two approaches?
About approach 1: This is the simpler approach one can think of, and it is also presented in the official documentation but:
Each directed relationship make the object record grow: are there any limitations on the number of the possible relationships given for instance by the object dimension limit?
Is that a JDO feature or also the datastore structure allows that approach to be naturally implemented?
The relationship search time will increase with the list, is this solution suitable for large (million) of relationships?
About approach 2: Each relationship can have a higher level of characterization (it is an object and it can have properties). And I think memory size is not a Google problem, but:
Each relationship requires its own record, so the search time for each related couple will increase as the total number of relationships increase. Is this suitable for large amount of relationships(millions, billions)? I.e. does Google have good tricks to search among records if they are well structured? Or I will be soon in a situation in which if I want to search a friend of User1 called User4 I have to wait seconds?
On the other side each object doesn't increase in dimension as new relationships are added.
Could you help me to find other important points on the two approaches in such a way to chose the best model?

First, the search time in the Datastore does not depend on the number of entities that you store, only on the number of entities that you retrieve. Therefore, if you need to find one relationship object out of a billion, it will take the same time as if you had just one object.
Second, the list approach has a serious limitation called "exploding indexes". You will have to index the property that contains a list to make it searchable. If you ever use a query that references more than just this property, you will run into this issue - google it to understand the implications.
Third, the list approach is much more expensive. Every time you add a new relationship, you will rewrite the entire entity at considerable writing cost. The reading costs will be higher too if you cannot use keys-only queries. With the object approach you can use keys-only queries to find relationships, and such queries are now free.
UPDATE:
If your relationships are directed, you may consider making Relationship entities children of User entities, and using an Object id as an id for a Relationship entity as well. Then your Relationship entity will have no properties at all, which is probably the most cost-efficient solution. You will be able to retrieve all objects owned by a user using keys-only ancestor queries.

I have an AppEngine application and I use both approaches. Which is better depends on two things: the practical limits of how many relationships there can be and how often the relationships change.
NOTE 1: My answer is based on experience with Objectify and heavy use of caching. Mileage may vary with other approaches.
NOTE 2: I've used the term 'id' instead of the proper DataStore term 'name' here. Name would have been confusing and id matches objectify terms better.
Consider users linked to the schools they've attended and vice versa. In this case, you would do both. Link the users to schools with a variation of the 'List' method. Store the list of school ids the user attended as a UserSchoolLinks entity with a different type/kind but with the same id as the user. For example, if the user's id = '6h30n' store a UserSchoolLinks object with id '6h30n'. Load this single entity by key lookup any time you need to get the list of schools for a user.
However, do not do the reverse for the users that attended a school. For that relationship, insert a link entity. Use a combination of the school's id and the user's id for the id of the link entity. Store both id's in the entity as separate properties. For example, the SchoolUserLink for user '6h30n' attending school 'g3g0a3' gets id 'g3g0a3~6h30n' and contains the fields: school=g3g0a3 and user=6h30n. Use a query on the school property to get all the SchoolUserLinks for a school.
Here's why:
Users will see their schools frequently but change them rarely. Using this approach, the user's schools will be cached and won't have to be fetched every time they hit their profile.
Since you will be getting the user's schools via a key lookup, you won't be using a query. Therefore, you won't have to deal with eventual consistency for the user's schools.
Schools may have many users that attended them. By storing this relationship as link entities, we avoid creating a huge single object.
The users that attended a school will change a lot. This way we don't have to write a single, large entity frequently.
By using the id of the User entity as the id for the UserSchoolLinks entity we can fetch the links knowing just the id of the user.
By combining the school id and the user id as the id for the SchoolUser link. We can do a key lookup to see if a user and school are linked. Once again, no need to worry about eventual consistency for that.
By including the user id as a property of the SchoolUserLink we don't need to parse the SchoolUserLink object to get the id of the user. We can also use this field to check consistency between both directions and have a fallback in case somehow people are attending hundreds of schools.
Downsides:
1. This approach violates the DRY principle. Seems like the least of evils here.
2. We still have to use a query to get the users who attended a school. That means dealing with eventual consistency.
Don't forget Update the UserSchoolLinks entity and add/remove the SchoolUserLink entity in a transaction.

You question is too complex but I try explain the best solution (I will answer in Python but same can be done in Java).
class User(db.User):
followers = db.StringListProperty()
Simple add follower.
user = User.get(key)
user.followers.append(str(followerKey))
This allow fast query who is followed and followers
User.all().filter('followers', followerKey) # -> followed
This query i/o costly so you can make it faster but more complicated and costly in i/o writes:
class User(db.User):
followers = db.StringListProperty()
follows = db.StringListProperty()
Whatever this is complicated during changes since delete of Users need update follows so you need 2 writes.
You can also store relationships but it is the worse scenario since it is more complex than second example with followers and follows ... - keep in mind than entity can have 1Mb it is not limit but can be.

Design for Users-Challenges Relationship in Java/Android

I am just starting off with app development and am currently writing an Android application which has registered users and a list of 'challenges' which they are able to select and later mark as completed/failed.
The plan is to eventually store all users/challenge/etc data on a database though I haven't implemented this yet.
The issue I have run in to is this - in my current design each User has list variables containing their current challenges and completed challenges eg. two ArrayList fields.
Users currently select challenges from a listview of different Challenge objects, which are then added to the user's CurrentChallenges list.
What I had not accounted for is how to structure this so that when a user takes on a challenge, they have their own unique copy of that challenge that can be independently marked as completed etc, whereas at the minute every user that selects say, Challenge 1, is simply adding the same challenge with the same ID etc. as each other user that selects Challenge 1.
I supposed I could have each different challenge be its own sub-class of Challenge and assign every user which selects that challenge type a different instance of that class, however this seems like it would be a very messy/inefficient method as all the different classes would be largely the same.
Does anyone have any good ideas or design patterns for this case? Preferably a solution that will be compatible with later storing these challenges in a database and presumably using ORM.
Thanks a lot for any suggestions,
E

I'd move every aspect of a challenge that is different for each user into a new Attempt class. So Challenge might have variables for name, description etc. and Attempt would have inProgress, completed etc. Obviously these are just examples, replace them with whatever data you're actually storing.
Now in your User class, you can record challenges using a Map. Make it a Map<Challenge, Attempt> and each User will be able to store an Attempt for each Challenge to record their progress. The Challenge instances are shared between users but there is an Attempt instance for each combination of User and Challenge.
When you implement the database later, Challenge, User and Attempt would each translate to a table. Attempt would have foreign keys for both of the other tables. Unfortunately I haven't used ORMs much so I'm not sure whether they'd work with a Map correctly.

How to efficiently store multiple different counter values on a user in a MySQL based application?

I want to store different kinds of counters for my user.
Platform: Java
E.g. I have identified:
currentNumRecords
currentNumSteps
currentNumFlowsInterval1440
currentNumFlowsInterval720
currentNumFlowsInterval240
currentNumFlowsInterval60
currentNumFlowsInterval30
etc.
Each of the counters above needs to be reset at the beginning of each month for each user. The value of each counter can be unpredictably high with peaks etc. (I mean that a lot of things are counted, so I want to think about a scalable solution).
Now my question is what approach to take to:
a) Should I have separate columns for each counter on the user table and doing things like 'Update set counterColumn = counterColumn+ 1' ?
b) put all the values in some kind of JSON/XML and put it in a single column? (in this case I always have to update all values at once)
The disadvantage I see is row locking on the user table everytime a single counter is incremented.
c) having an separate counter table with 3 columns (userid, name, counter) and doing one INSERT for each count + having a background job doing aggregates which are written to the User table? In this case would it be ok to store the aggregated counters as JSON inside a column in the user table?
d) Doing everything in MySQL or also use another technology? I also thought about using another solution for storing counters and only keeping the aggregates in MySQL. E.g. I have experimented with Apache Cassandra's distributed counters. My concerns are about the Transactions which cassandra does not have.
I need the counters to be exact because they are used for billing, thus I don't know if Cassandra is a good fit here, although the scalability of Cassandra seems tempting.
What about Redis for storing the counters + writing the aggregates in MySQL? Does Redis have stuff which helps me here? Or should I just store everything in a simple Java HashMap in-memory and have a aggregation background thread and don't use another technology?
In summary I am concerned about:
reduce row locking
have exact counters (transactions?)
Thanks for your ideas :)

You're sort of saying contradictory things.
The number of counts can be huge or at least unpredictable per user.
To me this means they must be uniform, like an array. It is not possible to have an unbounded number of heterogenous data, unless you have an unbounded amount of code and an unbounded number of developer hours to expend.
If they are uniform they should be flattened into a table user_counter where each row is of the form (user_id, counter_name, counter_value). However you will need to think carefully about what sort of indices you will need, etc. Updating at the beginning of the month if they are all set to zero or some default value is one SQL query.
Basically (c). (a) and (b) are most absurd and MySQL is still a suitable technology for this.

Your requirement is not so untypical. In general this is statistical session/user/... bound written data.
The first thing is to split things if not already done so. Make a mostly readonly database, and separately collect these data. So a separated user table for the normal properties.
The statistical data could be held in an in-memory table. You could also use means other than a database, a message queue, session attributes.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.