I am implementing a clustering algorithm on a large dataset. The dataset is in a text file and it contains over 100 million records. Each record contains 3 numeric fields.
1,1503895,4
3,2207774,5
6,2590061,3
...
I need to keep all this data in memory if possible, since as per my clustering algorithm, I need to randomly access records in this file. There fore I can't do any partition and merging approaches as described in Find duplicates in large file
What are possible solutions to this problem? Can I use caching techniques like ehcache?
300 million ints shouldnt consume that much memory. Try instantiating an array of 300 million ints. Back of my hand calculation, on a 64 bit machine, is about 1.2 GB.
Related
I have a HBase Table(Written through Apache Phoenix) , That needs to be read and write to a Flat Text File. Current Bottleneck is as we have 32 salt buckets for that HBase(Phoenix) table it opens only 32 mappers to read. And when the data grows over 100 Billion it becomes time consuming. Can someone point me how to control the number of mappers per region server for reading a HBase table? I also have seen program that explains in below URL , "https://gist.github.com/bbeaudreault/9788499" but I does not have a driver program that explains fully. Can someone help?
In my observation, number of regions of table = number of mappers opened by framework .
so reduce number of regions which will in turn reduce number of mappers.
How can this be done :
1) pre-split hbase table while creating for ex 0-9 .
2) load all the data with in these regions by generating row prefix between 0-9.*
Below are various ways to do Splitting :
Also, have a look at apache-hbase-region-splitting-and-merging
Moreover, setting number of mappers does not guarantee that it will open those many, it was driven by input splits
You can change number of mappers using setNumMapTasks or conf.set('mapred.map.tasks','numberofmappersyouwanttoset') (but its a suggestion to configuration ).
About link provided by you, I don't know what is this how it works you can check with author.
Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.
I am new in NoSQL systems. I want to use Java+Spring+MongoDB (not important).
I try to build correct scheme for my data. I'll have too much log records (something about 3 000 000 000 per year). Record structure looks like this:
{
shop: 'shop1',
product: 'product1',
count: '10',
incost: '100',
outcost: '120',
operation: 'sell',
date: '2015-12-12'
}
I have about 1000 shops and about 30000 products.
I should have reports with sum of count or sum of (sum*(outcost-incost)) by [shops]+product splited by days or months.
*[shops] means optional filter. In this case (without shops) performance is not matter.
*Reports older than 1 year may be required but performance is not matter.
Can i use single collection "logs" with indexes on date, shop, product. Or i should split this collection to subcollections by shops and years explicitly?
Sorry if my question is stupid, i am just beginner...
Regards,
Minas
Unless and until the document grows further, this works fine. In case, if you want to add more fields to the existing document or append the existing fields and if you think it may grow beyond 16 MB, then its better to have separate collections.
Indexing keys also appear to be fine as you are having compound index on shop, date and product fields.
You would be having some performance gain(easy and fast as only single disk seek happens), if complete data is retrieved from single collection rather fetched from multiple collections.
I would not do much aggregation on the main collection, 3 billion records is quite a lot.
One massive problem I can think with this is that any query will likely be huge, returning a massive number of documents. Now, it is true that you can mitigate most negative factors of querying this collection by using sharding to spread out the weight of the data itself, however, the sheer amount of data returned to the mongos will likely be slow and painful.
There comes a time when no amount of index will save you because your collection is just too darn big.
This would not matter if you was just displaying the collection, MongoDB could do that easily, it is aggregation that will not work well.
I would do as you suggest: pre-aggregate into other collections based on data fragments and time buckets.
I have an excel sheet with a million rows. Each row has 100 columns.
Each row represents an instance of a class with 100 attributes, and the columns values are the values for these attributes.
What data structure is the most optimal for use here, to store the million instance of data?
Thanks
It really depends on how you need to access this data and what you want to optimize for – like, space vs. speed.
If you want to optimize for space, well, you could just serialize and compress the data, but that would likely be useless if you need to read/manipulate the data.
If you access by index, the simplest thing is an array of arrays.
If you instead use an array of objects, where each object holds your 100 attributes, you have a better way to structure your code (encapsulation!)
If you need to query/search the data, it really depends on the kind of queries. You may want to have a look at BST data structures...
One million rows with 100 values where is each value uses 8 bytes of memory is only 800 MB which will easily fit into the memory of most PC esp if they are 64-bit. Try to make the type of each column as compact as possible.
A more efficient way of storing the data is by column. i.e. you have array for each column with a primitive data type. I suspect you don't even need to do this.
If you have many more rows e.g. billions, you can use off heap memory i.e. memory mapped files and direct memory. This can efficient store more data than you have main memory while keeping you heap relatively small. (e.g. 100s of GB off-heap with 1 GB in heap)
If you want to store all the data in memory, you can use one of the implementations of Table from Guava, typically ArrayTable for dense tables or HashBasedTable if most cells are expected to be empty. Otherwise, a database (probably with some cache system like ehcache or terracota) would be a better shot.
Your best option would be to store them in a table in an actual database, like Postgres etc. These are optimised to work for what you are talking about!
In that kind of data i would prefer using a MYSQL database because it is faster and can accumulate a large file like that.
The best option would be using a database that can store large number of data and fast enough for faster accessibility like ORACLE, MSSQL, MYSQL and any other databases that are fast and can store large amount of data.
If you really have a million rows or more with 100 values each, I doubt it will all fit into your memory... or is there a special reason for it? For example poor performance using a database?
Since you wnat to have random access, I'd use a persistence provider like hibernate and some database you like (for example mysql).
But be aware that the way you use the persistence provider has a great impact on performance. For example you should use batch-inserts (which are incompatible with autogenerated ids).
I'm in the early stages of doing a web project which will require working with arrays containing around 500 elements of custom object type. Objects will likely contain between 10 and 40 fields (based on user input), mostly booleans, strings and floats. I'm gonna use PHP for this project, but I'm also interested to know how to treat this problem in Java.
I know that "premature optimization is the root of all evil", but I think I need to decide now, how do I handle those arrays. Do I keep them in the Session object or do I store them in the database (mySQL) and keep just a minimum amount of keys in the session. Keeping data in the session would make application work faster, but when visitor numbers start growing I risk using up too much memory. On the other hand reading and writing from and into database all the time will degrade performance.
I'd like to know where the line is between those two approaches. How do I decide when it's too much data to keep inside session?
When I face a problem like this I try to estimate the size of per user data that I want to keep fast.
If your case, suppose for example to have 500 elements with 40 fields each of which sizing 50 bytes (making an average among texts, numbers, dates, etc.). So we have to keep in memory about 1MB per user for this storage, so you will have about 1GB every 1000 users only for this cache.
Depending on your server resource availability you can find bottlenecks: 1000 users consume CPU, memory, DB, disks accesses; so are in this scenario 1GB the problem? If yes keep them in DB if no keep them in memory.
Another option is to use an in-memory DB or a distributed cache solution that does it all for you, at some cost:
architectural complexity
eventually licence costs
I would be surprised if you had that amount of unique data for each user. Ideally, some of this data would be shared across users, and you could have some kind of application-level cache that stores the most recently used entries, and transparently fetches them from the database if they're missing.
This kind of design is relatively straightforward to implement in Java, but somewhat more involved (and possibly less efficient) with PHP since it doesn't have built-in support for application state.