Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.
Related
I have some data which I have to ingest every day into Solr, per day data is around 10-12 GB, and I have to run a catch-up job for last 1 year, every day is around 10-12 GB data.
I am using Java and I need scoring in my data by doing a partial update if same unique key arrives again, I used docValues with TextField.
https://github.com/grossws/solr-dvtf
Initially, I used a sequential approach which took a lot of time(reading from S3 and adding to Solr in batches of 60k).
I found this repo:
https://github.com/lucidworks/spark-solr,
but I couldn't understand the implementation as I needed to modify field data for some scoring logic, so wrote custom spark code.
Then I created 4 nodes in Solr(on the same IP), and used Spark to insert data, initially as the partitions created by Spark were way more than the Solr nodes and also the 'executors' specified were more than nodes, so it took way much more time.
Then I repartitioned the RDD into 4(no. of Solr nodes), specified 4 executors, then insertion took less time and was successful, but when I ran the same for a month, one or more Solr nodes kept on going down, I have enough free space on HD, and rarely my ram usage ends up being full.
Please suggest me a way to solve this problem, and I have 8 core CPU,
or should I use a different system for different nodes on Solr?
Thanks!
I am not sure spark would be the best way to load that much of data into solr.
Your possible options for loading data into solr are :
Through hbase-indexer also called batch indexer which syncs data between your hbase table and solr index.
You can also implement an hbase-lily-indexer which is almost in real time.
You can also use solr's jdbc utility - THE BEST in my opinion. What you can do is read data from s3 load into an hive table through spark. Then you can implement a solr jdbc to your hive table and trust me it is very fast.
Let me know if you want more information on any of these.
I'm using Mongodb as a cache right now. The application will be fed with 3 CSVs over night and the CSVs get bigger because new products will be added all the time. Right now, I'm reached 5 million records and it took about 2 hours to process everything. As the cache is refreshed everyday it'll become impractical to refresh the data.
For example
CSV 1
ID, NAME
1, NAME!
CSV 2
ID, DESCRIPTION
1, DESC
CSV 3
ID, SOMETHING_ELSE
1, SOMETHING_ELSE
The application will read CSV 1 and put it in the database. Then CSV 2 will be read if there're new information it will be added to the same document or create a new record. The same logic applies for CSV 3. So, one document will get different attributes from different CSVs hence the upsert. After everything is done then all the documents will be indexes.
Right now the first 1 million documents is relatively quick, but I can see the performance degrades considerably over time. I'm guessing it's because of the upsert as Mongodb has to find the document and update the attributes otherwise create it. I'm using Java Driver and MongoDB 2.4. Is there anyway I could improve or even do batch upsert in mongodb java driver?
What do you mean by 'after everything is done then all the documents will be indexed'?
If it is because you want to add additional indexes, it is debatable to do it at the end, but it is fine.
If you have absolutely no indexes, then this is likely your issue.
You want to ensure that all inserts/upserts you are doing are using an index. You can run one command and use .explain() to see if an index is getting used appropriately.
You need an index, otherwise you are scanning 1 million documents for each insert/update.
Also, can you also give more details about your application?
are you going to do the import in 3 phases only once, or will you do frequent updates?
do CSV2 and CSV3 modify a large percentage of the documents?
do the modifications of CSV2 and CSV3 add or replace documents?
what is the average size of your documents?
Let's assume you are doing a lot updates on the same documents many times. For example, CSV2 and CSV3 have updates on the same documents. Instead of importing for CSV1, then doing updates for CSV2, then another set of updates for CSV3, you may want to simply keep the documents in the memory of your application, apply all the updates in memory, then push your documents in the database. That assumes that you have enough RAM to do the operation, otherwise you will be using the disk again.
Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?
I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.
I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.
I have a big list of over 20000 items to be fetched from DB and process it daily in a simple console based Java App.
What is the best way to do that. Should I fetch the list in small sets and process it or should I fetch the complete list into an array and process it. Keeping in an array means huge memory requirement.
Note: There is only one column to process.
Processing means, I have to pass that string in column to somewhere else as a SOAP request.
20000 items are string of length 15.
It depends. 20000 is not really a big number. If you are only processing 20000 short strings or numbers, the memory requirement isn't that large. But if it's 20000 images that is a bit larger.
There's always a tradeoff. Multiple chunks of data means multiple trips to the database. But a single trip means more memory. Which is more important to you? Also can your data be chunked? Or do you need for example record 1 to be able to process record 1000.
These are all things to consider. Hopefully they help you come to what design is best for you.
Correct me If I am Wrong , fetch it little by little , and also provide a rollback operation for it .
If the job can be done on a database level i would fo it using SQL sripts, should this be impossible i can recommend you to load small pieces of your data having two columns like the ID-column and the column which needs to be processed.
This will enable you a better performance during the process and if you have any crashes you will not loose all processed data, but in a crash case you eill need to know which datasets are processed and which not, this can be done using a 3rd column or by saving the last processed Id each round.