MongoDB related scaling issue - java

Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?

I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.

Related

SQL query performance, archive vs status change

Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.
100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.
As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

Performance Optimization in Java

In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?
Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.
It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.

CouchDB data replication

I have 30 GB of twitter data stored in CouchDB. I am aiming to process each tweet in java but the java program is not able to hold such a large data at a time. In order to process the entire dataset, I am planning to divide my entire dataset into smaller ones with the help of filtered replication supported by CouchDb. But, as I am new to couchDB, I am facing a lot of problems in doing so. Any better ideas for doing it are welcome. Thanks.
You can always query couchdb for a dataset that is small enough for your java program, so there should be no reason to replicate subsets to smaller databases. See this stackoverflow answer for a way to get paged results from couchdb. You might even employ couchdb itself for the processing with map/reduce, but that depends on your problem.
Depending on the complexity of the queries and the changes you make when processing your data set you should be fine with one instance.
As the previous poster you can use paged results, I tend to do something different:
I have a document for social likes. The latter always refers to a URL and I want to try and have an update at every 2-3 hours.
I have a view that sorts URL's by the documents by the age of the last update request and the last update.
I query this view so that I exclude the articles that had a request within 30 minutes or have been updated less than 2 hours ago.
I use rabbit MQ when enqueuing the jobs and if these are not picked up within 30 minutes, they expire.

best way to store huge data into mysql using java

I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.

Categories