Best way to insert a lot of Data in Solr - java

I have some data which I have to ingest every day into Solr, per day data is around 10-12 GB, and I have to run a catch-up job for last 1 year, every day is around 10-12 GB data.
I am using Java and I need scoring in my data by doing a partial update if same unique key arrives again, I used docValues with TextField.
https://github.com/grossws/solr-dvtf
Initially, I used a sequential approach which took a lot of time(reading from S3 and adding to Solr in batches of 60k).
I found this repo:
https://github.com/lucidworks/spark-solr,
but I couldn't understand the implementation as I needed to modify field data for some scoring logic, so wrote custom spark code.
Then I created 4 nodes in Solr(on the same IP), and used Spark to insert data, initially as the partitions created by Spark were way more than the Solr nodes and also the 'executors' specified were more than nodes, so it took way much more time.
Then I repartitioned the RDD into 4(no. of Solr nodes), specified 4 executors, then insertion took less time and was successful, but when I ran the same for a month, one or more Solr nodes kept on going down, I have enough free space on HD, and rarely my ram usage ends up being full.
Please suggest me a way to solve this problem, and I have 8 core CPU,
or should I use a different system for different nodes on Solr?
Thanks!

I am not sure spark would be the best way to load that much of data into solr.
Your possible options for loading data into solr are :
Through hbase-indexer also called batch indexer which syncs data between your hbase table and solr index.
You can also implement an hbase-lily-indexer which is almost in real time.
You can also use solr's jdbc utility - THE BEST in my opinion. What you can do is read data from s3 load into an hive table through spark. Then you can implement a solr jdbc to your hive table and trust me it is very fast.
Let me know if you want more information on any of these.

Related

Trying bulk/ingest "large" amount of documents SQL Db to Elasticsearch

Hi I need to read multiple tables from my databases and join the tables. Once the tables are joined I would like to push them to Elasticsearch.
The tables are joined from an external process as the data can come from multiple sources. This is not an issue in fact I have 3 separate processes reading 3 separate tables at an average of 30,000 records per second. The records are joined into a multimap, which then a single JsonDocument is produced for each key.
Then there is a separate process reads the denormalized JsonDocuments and bulks them to Elasticsearch at an average of 3000 documents per second.
I'm having troubles trying to find a way to split the work. I'm pretty sure my Elasticsearch cluster can handle more than 3000 documents per second. I was thinking somehow split the multimap that holds the Joined json docs.
Anyways I'm building a custom application for this. So I was wondering is there any tools that can be put together to do all this? Either some form of ETL, or stream processing or something?
While streaming would make records more readily available then bulk processing, and would reduce the overhead in the java container regarding large object management, you can have a hit on the latency. Usually in these kind of scenarios you have to find an optimum for the bulk size. In this I follow the following steps:
1) Build a streaming bulk insert (so stream but still get more then 1 record (or build more then 1 JSON in your case at the time)
2) Experiment with several bulk sizes: 10,100,1000,10000 for example and plot them in a quick graph. Run a sufficient amount of records to see if performance does not go down over time: It can be that the 10 is extremely fast per record, but that there is an incremental insert overhead (for example the case in SQL Server on the primary key maintenance). If you run the same number of total records for every test, it should be representative of your performance.
3) Interpolate in your graph and maybe try out 3 values between your best values of run 2
Then use the final result as your optimal stream bulk insertion size.
Once you have this value, you can add one more step:
Run multiple processes in parallel. This then fills the gaps in you process a bit. Watch the throughput and adjust your bulk sizes maybe one more time.
This approach once helped me with a multi TB import process to speed up from 2 days to about 12hrs, so it can work out pretty positive.

CouchDB data replication

I have 30 GB of twitter data stored in CouchDB. I am aiming to process each tweet in java but the java program is not able to hold such a large data at a time. In order to process the entire dataset, I am planning to divide my entire dataset into smaller ones with the help of filtered replication supported by CouchDb. But, as I am new to couchDB, I am facing a lot of problems in doing so. Any better ideas for doing it are welcome. Thanks.
You can always query couchdb for a dataset that is small enough for your java program, so there should be no reason to replicate subsets to smaller databases. See this stackoverflow answer for a way to get paged results from couchdb. You might even employ couchdb itself for the processing with map/reduce, but that depends on your problem.
Depending on the complexity of the queries and the changes you make when processing your data set you should be fine with one instance.
As the previous poster you can use paged results, I tend to do something different:
I have a document for social likes. The latter always refers to a URL and I want to try and have an update at every 2-3 hours.
I have a view that sorts URL's by the documents by the age of the last update request and the last update.
I query this view so that I exclude the articles that had a request within 30 minutes or have been updated less than 2 hours ago.
I use rabbit MQ when enqueuing the jobs and if these are not picked up within 30 minutes, they expire.

MongoDB related scaling issue

Just FYI, this question is not exactly based on MongoDB, but happens to use MongoDB. I am assuming we might end up using MongoDB's feature such as sharding in a good design, hence mentioning about MongoDB. Also fwiw, we use Java.
So we have around 100 million records in a certain collection, of which we need to select all the items which have some data set to tomorrow. Usually this query returns 10 million records.
You can think that we have N (say ten) machines at our hand. We can assume, MongoDB is sharded based on record_id.
The each record that we will process is independent of the other records we are reading. No records will be written as part of this batch job.
What I am looking to do is,
Not to centralize workload distribution across different machine.
Fair or almost fair workload distribution
(not sure if the following requirement could be fullfilled without compromising requirement.1)
Fault tolerance (if one of the batch machine is down we want other machine to take its load.)
Any good solution, which has already worked in similar situation ?
I can speak in context of MongoDB
Requirements 1 and 2 is done through sharding. I'm not sure if I follow your question though as it sounds like 1 says you don't want to centralize workload and 2 is that you want to distribute work load evenly.
In any case, with the proper shard key, you will distribute your workload across your shards. http://docs.mongodb.org/manual/sharding/
Requirement 3 is performed via replica sets in MongoDB. http://docs.mongodb.org/manual/replication/
I would have to understand your application and use case more to know for certain, but pulling 10M records for a 100M record as your typical access pattern doesn't sound like the right document model is in place. Keep in mind that collection <> table and document <> record. I would look into storing your 10M records at a higher logical granularity so you pull less records; this will significantly improve performance.

Best way to have a fast access key-value storage for huge dataset (5 GB)

There is a dataset of ~5GB in size. This big dataset just has a key-value pair per line.
Now this needs to be read for the value of keys some billion times.
I have already tried disk based approach of MapDB, but it throws ConcurrentModification Exception and isn't mature enough to be used in production environment yet.
I also don't want to have it in a DB and make the call billion times (Though, certain level of in-memory caching can be done here).
Basically, I need to access these key-value dataset in mapper/reducer of a hadoop's job step.
So after trying out a bunch of things we are now using SQLite.
Following is what we did:
We load all the key-value pair data in a pre-defined database file (Indexed it on the key column, though it increased the file-size but was worth it.)
Store this file (key-value.db) in S3.
Now this is passed to the hadoop jobs as distributed cache.
In Configure of Mapper/Reducer the connection is opened (It takes around 50 ms) to the db file
In map/reduce method query this db with the key (It took negligible time, didn't even need to profile it, it was so negligible!)
Closed the connection in cleanup method of Mapper/Reducer
Try Redis. It seems this is exactly what you need.
I would try Oracle Berkerley DB Java Edition This support Maps and is both mature and scalable.
I noticed you tagged this with elastic-map-reduce... if you're running on AWS, maybe DynamoDB would be appropriate.
Also, I'd like to clarify: is this dataset going to be the input to your MapReduce job, or is it a supplementary dataset that will be accessed randomly during the MapReduce job?

best way to store huge data into mysql using java

I am a Java developer. I want to know what is the best way to store huge data into mysql using Java.
Huge: two hundred thousand talk messages every second.
An index is not needed here
Should I store the messages into the database as soon as the user creates them? Will it be too slow?
1 billion writes / day is about 12k / second. Assuming each message is about 16 bytes, that's about 200k / sec. If you don't care about reading, you can easily write this to disk at this rate, maybe one message per line. Your read access pattern is probably going to dictate what you end up needing to do here.
If you use MySQL, I'd suggest combining multiple messages per row, if possible. Partitioning the table would be helpful to keep the working set in memory, and you'll want to commit a number of records per transaction, maybe 1000 rows. You'll need to do some testing and tuning, and this page will be helpful:
http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html
You should probably also look at Cassandra which is written with heavy write workloads in mind.
My suggestion is also MongoDB. Since NoSQL paradigm fits your needs perfectly.
Below is a flavor of MongoDB in Java -
BasicDBObject document = new BasicDBObject();
document.put("database", "mkyongDB");
document.put("table", "hosting");
BasicDBObject documentDetail = new BasicDBObject();
documentDetail.put("records", "99");
documentDetail.put("index", "vps_index1");
documentDetail.put("active", "true");
document.put("detail", documentDetail);
collection.insert(document);
This tutorial is for good to get started. You can download MongoDB from github.
For optimization of MongoDB please refer this post.
Do you have to absolutely use MySQL or Are you open to other DBs as well? MongoDb or CouchDB will be a good fit for these kind of needs. Check them out if you are open to other DB options.
If you have to go absolutely with MySql, then we have done something similar all the related text messages go in a child as single json. We append to it every time and we keep master in a separate table. So one master and one child record at the minimum and more child records as the messages go beyond certain number ( 30 in our scenario) , implemented kind of "load more.." queries second child record which holds 30 more.
Hope this helps.
FYI, we are migrating to CouchDB for some other reasons and needs.
There are at least 2 different parts to this problem:
Processing the messages for storage in the database
What type of storage to use for the message
For processing the messages, you're likely going to need a horizontally scalable system (meaning you can add more machines to process the messages quickly) so you don't accumulate a huge backlog of messages. You should definitely not try to write these messages synchronously, but rather when a message is received, put it on a queue to be processed for writing to the database (something like JMS comes to mind here).
In terms of data storage, MySQL is a relational database, but it doesn't sound like you are really doing any relational data processing, rather just storing a large amount of data. I would suggest looking into a NoSQL database (as others have suggested here as well) such as MongoDB, Cassandra, CouchDB, etc. They each have their strengths and weaknesses (you can read more about each of them on their respective websites and elsewhere on the internet).
I guess, typical access would involve retrieving all text of one chat session at least.
The number of rows is large and your data is not so much relational. This is a good fit for Non-Relational database.
If you still want to go with MySQL, use Partitions. While writing, use batch inserts and while reading provide sufficient Partition pruning hints in your queries. Use EXPLAIN PARTITIONS to check whether partitions are being pruned. In this case I would strongly recommend that you combine chat lines of a one chat session into a single row. This will dramatically reduce the number of rows as compared to one chat line per row.
You didn't mention how many many days of data you want to store.
On a separate note: How successful would your app have to be in terms of users to require 200k messages per second? An active chat session may generate about 1 message every 5 seconds from one user. For ease of calculation lets make it 1 second. So you are building capacity for 200K online users. Which implies you would at least have a few million users.
It is good to think of scale early. However, it requires engineering effort. And since resources are limited, allocate them carefully for each task (Performance/UX etc). Spending more time on UX, for example, may yield a better ROI. When you get to multi-million user territory, new doors will open. You might be funded by an Angel or VC. Think of it as a good problem to have.
My 2 cents.

Categories