NoSQL data split

NoSQL data split - java

I am new in NoSQL systems. I want to use Java+Spring+MongoDB (not important).
I try to build correct scheme for my data. I'll have too much log records (something about 3 000 000 000 per year). Record structure looks like this:
{
shop: 'shop1',
product: 'product1',
count: '10',
incost: '100',
outcost: '120',
operation: 'sell',
date: '2015-12-12'
}
I have about 1000 shops and about 30000 products.
I should have reports with sum of count or sum of (sum*(outcost-incost)) by [shops]+product splited by days or months.
*[shops] means optional filter. In this case (without shops) performance is not matter.
*Reports older than 1 year may be required but performance is not matter.
Can i use single collection "logs" with indexes on date, shop, product. Or i should split this collection to subcollections by shops and years explicitly?
Sorry if my question is stupid, i am just beginner...
Regards,
Minas

Unless and until the document grows further, this works fine. In case, if you want to add more fields to the existing document or append the existing fields and if you think it may grow beyond 16 MB, then its better to have separate collections.
Indexing keys also appear to be fine as you are having compound index on shop, date and product fields.
You would be having some performance gain(easy and fast as only single disk seek happens), if complete data is retrieved from single collection rather fetched from multiple collections.

I would not do much aggregation on the main collection, 3 billion records is quite a lot.
One massive problem I can think with this is that any query will likely be huge, returning a massive number of documents. Now, it is true that you can mitigate most negative factors of querying this collection by using sharding to spread out the weight of the data itself, however, the sheer amount of data returned to the mongos will likely be slow and painful.
There comes a time when no amount of index will save you because your collection is just too darn big.
This would not matter if you was just displaying the collection, MongoDB could do that easily, it is aggregation that will not work well.
I would do as you suggest: pre-aggregate into other collections based on data fragments and time buckets.

Related

Sort an ArrayList of an object using Cache

Any suggestion regarding below problem would be appreciated.
Present situation:
I have an ArrayList of object. We have already implemented the sorting using comparator. The object has hundreds of field. So the size of one single object in an ArrayList is not small. Going forward when the size of ArrayList increases we feel like this will create problem in sorting because of overall size of the ArrayList.
Plan:
We will load the objects in Cache.
Instead of taking ArrayList of the object as input, we are planning to take ArrayList of an id (string) as input. And when an id is being compared we are planning to get the object from cache.
Problem:
I don't want to load all the objects in cache because this cache will be used only during the sorting. So I don't want to create a cache of huge size just for this.
What I was planning to do was load only half of the objects in cache and in case anything is not present in cache load it from DB and read it as well as put it in cache (Which will replace one of the object in cache). I don't want to query the DB for a single object because this way I would be hitting DB tens of thousands of time.
I want to do bulk read from DB, but I was not able to strategize that.
Any suggestion will be appreciated.

You're very confused.
The object has hundreds of field.
Irrelevant. Java uses references; that 'arraylist of object' you have is backed by an array, and each slot in that array takes about 8 bytes (depends on underlying VM details, could be 4 bytes too). These represent the location in memory, more or less, where the object lives.
Going forward when the size of ArrayList increases
.... no, it won't. If you put 100,000 entries in this list, that's a total memory load, at least for the list itself, of at most 800,000 bytes, that's less than a megabyte. Put it this way: On modern hardware, that list alone can contain 100 million items and your system wouldn't be breaking a sweat (that'd be less than a GB of memory for the references). Now, if you have 100 million unique objects too (vs. say, adding the exact same object 100 million times, or adding null 100 million times), that object ALSO occupies memory. That could be a problem. But the list is not the relevant part.
we feel like this will create problem in sorting because of overall size of the ArrayList.
No. When you sort an array list, you'll get ~ nlogn operations to sort it. The actual sorting infrastructure parts (moving objects around in the list) are near enough to zero cost (it's just blitting those 4 to 8 byte sequences around on a single memory page. Assuming the invocation of .compare() is cheap, even a throwaway $100 computer can sort millions of entries in fractions of seconds. Which just leaves the ~nlogn invokes of .compare(). If that's expensive, okay, you may have a problem. So, in a list of 1 million entries, you're looking at roughly 13 million invokes of your compare method.
How fast is it?
If calling .compare(a, b) (where a and b are pointers to instances of your 'hundreds of fields' objects) inspects every single one of those hundreds of fields, that could get a little tricky perhaps, but if it just checks a few of them, there's nothing to worry about here. CPUs are FAST. You may go: "MILLIONS? Oh my gosh!", but your CPU laughs at this job.
We will load the objects in Cache.
This plan is bad, because of the above reasons.
I want to do bulk read from DB
Okay, so when you started out with 'we have an arraylist of object', actually you don't have that, and you have a DB connection? Which one is it?
Either you have all your data in an arraylist, or you have your data in a DB. If it's all in an arraylist, the DB part is irrelevant. If you don't have all your data in an arraylist, your question misleads and isn't clear.
If the data is in a DB, set up proper indices and use the ORDER BY clause.

SQL query performance, archive vs status change

Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.
My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.
Method 1
Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.
Method 2
Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.
Method 3
Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.

100,000 rows per year is not that much. [1]
There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.
However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.
[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.

As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.
In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.
Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.
For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.
That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.

MongoDB related scaling issue

Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?

I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.

Processing large amount of data in java

I am implementing a clustering algorithm on a large dataset. The dataset is in a text file and it contains over 100 million records. Each record contains 3 numeric fields.
1,1503895,4
3,2207774,5
6,2590061,3
...
I need to keep all this data in memory if possible, since as per my clustering algorithm, I need to randomly access records in this file. There fore I can't do any partition and merging approaches as described in Find duplicates in large file
What are possible solutions to this problem? Can I use caching techniques like ehcache?

300 million ints shouldnt consume that much memory. Try instantiating an array of 300 million ints. Back of my hand calculation, on a 64 bit machine, is about 1.2 GB.

Best way to sort the data : DB Query or in Application Code

I have a Mysql table with some data (> million rows). I have a requirement to sort the data based on the below criteria
1) Newest
2) Oldest
3) top rated
4) least rated
What is the recommended solution to develop the sort functionality
1) For every sort reuest execute a DBQuery with required joins and orderBy conditions and return the sorted data
2) Get all the data (un sorted) from table, put the data in cache. Write custom comparators (java) to sort the data.
I am leaning towards #2 as the load on DB is only once. Moreover, application code is better than DBQuery.
Please share your thoughts....
Thanks,
Karthik

Do as much in the database as you can. Note that if you have 1,000,000 rows, returning all million is nearly useless. Are you going to display this on a web site? I think not. Do you really care about the 500,000th least popular post? Again, I think not.
So do the sorts in the database and return the top 100, 500, or 1000 rows.

It's much faster to do it in the database:
1) the database is optimized for I/O operations, and can use indices, and other DB optimizations to improve the response time
2) taking the data from the database to the application will get all data into memory. The app will have to look all the data to redorder it without optimized algorithms
3) the database only takes the minimun necessary data into mamemory, which can be much less than all the data whihc has to be moved to java
4) you can always create extra indices on the database to improve the query performance.

I would say that operation on DB will be always faster. You should ensure that caching on DB is ON and working properly. Ensure that you are not using now() in your query because it will disable mysql cache. Take a look here how mysql query cache works. In basic. Query is cached based on string so if query string differs every time you fetch no cache is used.

AFAIK usually it should run faster if you let the DB sort your data.
And regarding code on application level vs db level I would agree in the case of stored procedures but sorting in SELECTs is fine IMHO.
If you want to show the data to the user also consider paging (in which case you're better off with sorting on the db level anyway).

Fetching a million rows from the database sounds like a terrible idea. It will generate a lot of networking traffic and require quite some time to transfer all the data. Not mentioning amounts of memory you would need to allocate in your application for storing million of objects.
So if you can fetch only a subset with a query, do that. Overall, do as much filtering as you can in the database.
And I do not see any problem in ordering in a single queue. You can always use UNION if you can't do it as one SELECT.

You do not have four tasks, you have two:
sort newest IS EQUAL TO sort oldest
AND
sort top rated IS EQUAL TO sort least rated.
So you need to make two calls to db. Yes sort in db. then instead of calling to sort every time, do this:
1] track the timestamp of the latest record in the db
2] before calling to sort and retrieve entire list, check if date has changed
3] if date has not changed, use the list you have in memory
4] if date has changed, update the list

I know this is an old thread, but it comes up in my search, so I'd like to post my opinion.
I'm a bit old school, but for that many rows, I would consider dumping the data from your database (each RDBMS has it's own method. Looks like MySQLDump command for MySQL: Link )
You can then process this with sorting algorithms or tools that are available in your java libraries or operating system.
Be careful about the work your asking your database to do. Remember that it has to be available to service other requests. Don't "bring it to it's knees" servicing only one request, unless it's a nightly batch cycle type of scenario and you're certain it won't be asked to do anything else.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.