I built a social android application in which users can see other users around them by gps location. at the beginning thing went well as i had low number of users, But now that I have increasing number of users (about 1500 +100 every day) I revealed a major problem in my design.
In my Google App Engine servlet I have static HashMap that holding all the users profiles objects, currenty 1500 and this number will increase as more users register.
Why I'm doing it
Every user that requesting for the users around him compares his gps with other users and check if they are in his 10km radius, this happens every 5 min on average.
That is why I can't get the users from db every time because GAE read/write operation quota will tare me apart.
The problem with this desgin is
As the number of users increased the Hashmap turns to null every 4-6 hours, I thing that this time is getting shorten but I'm not sure.
I'm fixing this by reloading the users from the db every time I detect that it became null, But this causes DOS to my users for 30 sec, So I'm looking for better solution.
I'm guessing that it happens because the size of the hashmap, Am I right?
I would like to know how to manage ALL users profiles with max aviablity.
Thanks.
I would not store this data in the HashMap as it does not really scale if you run on multiple instances and furthermore you use a lot of memory.
Why do you not use some different storages like MongoDB which is also available 'in the cloud'? (e.g. www.mongohq.com).
If you would like to scale you need to separate the data from the processors. E.g. have x servers running your servlet (or let Google AppEngine scale this on themselves) and have the data at a different place (e.g. in a MongoDB or PostgreSQL).
You need to rethink your whole design. Storing all users in one huge HashMap won't scale (sooner or later you'll have to cluster your application). Also the complexity of your algorithm is quite high - you need to traverse the whole map for each user.
A much more scalable solution would be to use a spatial database. All major relation databases and some NoSQL products offer geospatial indexing. Basically the database query engine is optimized for queries like: give me all the records with near this given point.
If your application is really successful, even an in-memory map will be slower than enterprise-grade geospatial index.
Related
I have a datastore that stores the cab booking details of the customers. In the admin console I need to display the statistics to the admin, like busiest location, peak hours, total bookings in a particular location in a particular day. For the busiest location i need to retrieve the location from where most number of cabs has been booked. Should I iterate through the entire datastore and keep a count or is there any method to know which location has maximum and minimum duplicates.
I am using a ajax call to java servlet which should return the busiest location.
And I also need a suggestion for maintaining such a stats page. Should I keep a separate Entity kind just for counters and stats and update it everytime when a customer books a cab or is the logic correct for iterating through the entire datastore for the stats page. Thanks in advance.
There are too many unknowns about your data model and usage patterns to offer a specific solution, but I can offer a few tips.
Updating a counter every time you create a new record will increase your writing costs by 2 write operations, which may or may not be significant.
Using keys-only queries is very cheap and fast. It is the preferred method for counting something, so you should try to model your data in such a way that a keys-only query can give you an answer. For example, if a "trip" entity has a property for "id of a starting point", and this property is indexed, you can loop through your locations using a keys-only query to count the number of trips that started from each location.
Assuming that you record a lot of trips, and that an admin page will be visited/refreshed not very frequently, the keys-only queries approach is the way to go. If the admin page is visited/refreshed many times per hour, you may be better off with the counters.
Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?
I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.
Need advice on caching and paging. Scenario goes like this..
User give to variable range (say variable X from x1 to x2 and Y from y1 to y2) and I fetch the data from database, after that some logic do ordering on this result and give first page to user back..
For every user these (X & Y) are different.
Problem start when user ask for second page i have to fire the query and order the result again and give the second page result.
This has to be done for each user request..
Can u suggest me any caching strategies for this..
(Java + mysql)
If I am not clear do let me know...
If you use hibernate for DB acccess you can allow second_level_caching to get EhCache caching your results automatically. Read more on this here
I assume that a user can make a request for eg x1 to x100 and get 100 results, and you want to display results paginated with eg 10 results at time.
I have three suggestions, each with various merits.
Rely on MySQL to do the caching.
MySQL is pretty good at caching. If you repeat the same query a number of times, then MySQL will try and cache the results for you, making subsequent queries very fast (they come from memory and don't touch disk). If you query x1 to x100 and just display x1 to x10, then when you want to display page 2 you query x1 to x100 again MySQL can use its own internal cache. MySQL caching obviously uses RAM, so you need to ensure your DB server has enough RAM to be able to cache effectively. You will have to estimate how much data you expect to be caching and see if this is feasible.
This is an easy solution and hardware (RAM) is fairly cheap*.
Cache internally
Save the whole x1 to x100 results in your application (for eg in a user session) and display the relevant result for the page requested. This is similar to using MySQL to cache, but you are moving the cache closer to where it is needed. If you do this you have to think about things like cache management yourself (eg when to expire cache, managing memory use). You can use existing caching tools like ehcache for this.
Prefetch instead of caching
Caching requires a lot of memory. If caching isn't feasible then you might want to consider a prefetch strategy. With no caching you might experience slow response times but the problem is somewhat alleviated with a prefetch. This is where after a user has viewed one page you do an async look up of other pages the user is likely to view (usually next and previous). The lookup may be slow but you are doing it before the user has asked for it so its ready immediately when they need it.
Prefetching is a fairly common technique in web applications; when a user views a page the application can use ajax to load the prev and next pages into the DOM behind the scenes while the user is viewing the current page. When the user clicks the 'next' link the application modifies the DOM to show the next page without having to touch the server at all. To the user it looks like the response is instant.
*I was once on a MySQL administration course and on the topic of performance the tutor said the first thing anyone should do is fit as much RAM to the server as it supports. Most servers can be filled with RAM (were talking 100s of Gb) for less that the cost of a 3 day performance tuning course.
I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.
I'm in the early stages of doing a web project which will require working with arrays containing around 500 elements of custom object type. Objects will likely contain between 10 and 40 fields (based on user input), mostly booleans, strings and floats. I'm gonna use PHP for this project, but I'm also interested to know how to treat this problem in Java.
I know that "premature optimization is the root of all evil", but I think I need to decide now, how do I handle those arrays. Do I keep them in the Session object or do I store them in the database (mySQL) and keep just a minimum amount of keys in the session. Keeping data in the session would make application work faster, but when visitor numbers start growing I risk using up too much memory. On the other hand reading and writing from and into database all the time will degrade performance.
I'd like to know where the line is between those two approaches. How do I decide when it's too much data to keep inside session?
When I face a problem like this I try to estimate the size of per user data that I want to keep fast.
If your case, suppose for example to have 500 elements with 40 fields each of which sizing 50 bytes (making an average among texts, numbers, dates, etc.). So we have to keep in memory about 1MB per user for this storage, so you will have about 1GB every 1000 users only for this cache.
Depending on your server resource availability you can find bottlenecks: 1000 users consume CPU, memory, DB, disks accesses; so are in this scenario 1GB the problem? If yes keep them in DB if no keep them in memory.
Another option is to use an in-memory DB or a distributed cache solution that does it all for you, at some cost:
architectural complexity
eventually licence costs
I would be surprised if you had that amount of unique data for each user. Ideally, some of this data would be shared across users, and you could have some kind of application-level cache that stores the most recently used entries, and transparently fetches them from the database if they're missing.
This kind of design is relatively straightforward to implement in Java, but somewhat more involved (and possibly less efficient) with PHP since it doesn't have built-in support for application state.