MongoDB EC2 BenchMark Configuration - java

I am planning to use MongoDB on EC2 for my web application. Right now I am getting 8000 r/w on MongoDB. My MongoDB instance Type is m1-large.
For the optimum performance I have followed this sites:
site-1 site-2
I have tried a lot but failed to acheive that performace which is mentioned in above site. I want to know is there any other resource from where I can find the optimum EC2 performance Benchmark and some sort of configuration ?

Here are some things you can do to increase write performance on EC2:
Raid stuff
Raid 0 will get you better speed but make sure your application
doesn't require the mirroring provided by raid 10 (the recommended
setup)
Rais 10 is mirroring (raid 1) and striping (raid 0) together.
Note: There's a 2gbs rate limit between an individual EC2 instance and the
EBS service as a whole, so this is the maximum rate you could ever
possibly hope to write to EBS from a single node no matter how many
volumes you have.
Use sharding
The more the writes are spread around to different boxen the less
write load each one has.
Index only the importat stuff
Try to limit/reduce the amount of indexes you have. With each additional index, an insert or update will also incur additional writes (and therefore worse write performance).
Control Document size
Try to control document size so that they don't grow and therefore have to be moved to the end of the collection very often. If you have an array you are pushing to, try to fix (limit) the # of items and their individual size so that mongo can be smart about padding factor and help you out here.

Related

Fast indexing using multiple ES nodes?

All I read and understand about running multiple ES nodes is to enable index replication and scaling. I was wondering if it could help us to make indexing faster for large number of files. I have two questions and they are as follows:
Question 1: Would it be accurate to think that using multiple ES nodes would allow us to index multiple times faster?
Question 2: what effect does it have on indexing if I keep enable all nodes as a data node? on the other hand what effect does it have on indexing if I have few non-data nodes (e.g. one dedicate Master and one dedicate Client node) with few data nodes? Which will be better in terms of speed and scaling?
Answer1: No.
The speed of indexing will in fact decrease if you enable replication (though it may increase search performance). You can look at this question for improving indexing performance.
Answer2: It depends (if no replica then same).
During indexing the data will go only to data nodes. Your cluster state will contain information about which nodes is data node and route the request accordingly. The performance impact will only be because of the node receiving the request has to reroute/forward the request to the data nodes
If you are adding machines without increasing the number of replicas you will get a better performance during indexing. It is not surprising since you are adding more resources while the amount of work to be done remains pretty much the same.
In our environment we are using 20 nodes on production and 5-10 nodes on debug. Both environments hold the same volume of data. Since ES updates speed (We are using groovy scripts to merge new documents to existing documents) is our primary bottleneck, we are able to see much better performance in our production environment in oppose to other environments.
You already got some useful links at other answers to your question. I can add that in our case the 3 most significant factors in data upload improvements were: Reducing the refresh_interval, increasing the merge_factor and using Elastic-Hadoop plugin (We upload the data from Spark) that handles all the major data transfer optimisation on the application level.
Every one of those steps have its own disadvantages, so read the manuals carefully before changing the configurations.

How to get the logical data usage within a Cassandra cluster

We could look at the physical bytes on disk, but that number includes all the replicas. So I am wondering is there a good approach to get the logical data usage (the real meaningful data size) within the cluster without iterating all of them? Thanks.
Unfortunately no. Even the nodetool cfstats which shows live sstable size, includes the replica. You can iterate and add all and divide by the number of replicas you have to get a tough estimate of one.

Find shortest indexing time in solr

Description (for reference):
I want to index an entire drive of files : ~2TB
I'm getting the list of files (Using commons io library).
Once I have the list of files, I go through each file and extract readable data from that using Apache Tika
Once I have the data I'm indexing it using solr.
I'm using solrj with the java application
My question is: How do I decide what size of collection to pass to Solr. I've tried passing in different sizes with different results i.e. sometimes 150 documents per collection performs better than 100 documents but sometimes they do not. Is their an optimal way / configuration that you can tweak as this process has to be carried repeatedly.
Complications :
1) Files are stored on a network drive, retrieving the filenames/files takes some time too.
2) Both this program (java app) and solr itself cannot use more than 512MB of ram
I'll name just a few parameters of a number of them that may affect the indexing speed. Usually one needs to experiment with their own hardware, RAM, data processing complexity etc. to find the best combination, i.e. there is no single silver bullet for all.
Increase the number of segments during indexing to some large number. Say, 10k. This will make sure that merging of segments will not happen as often, as it would with the default number of segments 10. Merging the segments during the indexing contributes to slowing down the indexing. You will have to merge the segments after the indexing is complete for your search engine to perform. Also lower the number of segments back to something sensible, like 10.
Reduce the logging on your container during the indexing. This can be done using the solr admin UI. This makes the process of indexing faster.
Either reduce the frequency of auto-commits or switch them off and control the committing yourself.
Remove the warmup queries for the bulk indexing, don't auto-copy any cache entries.
Use ConcurrentUpdateSolrServer and if using SolrCloud, then CloudSolrServer.
comment out auto commit and tlogs and index on a single core. use multi threading in your solrj api (number of threads = no of cpu * 2) to hit a single core .
regards
Rajat

Replacing a huge dump file with an efficient lookup Java key-value text store

I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.

how much extra Space/RAM/CPU is used by apache solr?

I am using MySQL database for my webapp.
I need to search over multiple tables & multiple columns, it very similar like full text searching inside those columns.
I need know your experience of using any Full Text Search API (eg. solr/lucene/mapReduce/hadoop etc..) over using simple SQL in terms of :
Speed performance
Extra space usage
Extra CPU usage (is it continuously building index? )
How long it takes to build index or it get ready for use?
Please let me know your experience of using these frameworks.
Thanks a lot!
To answer your questions
1.) i have an database with round about 5 Million Docs. MySQL Fulltextsearch needs 2-3 Minutes. Solr/Lucene needs for the same search round about 200-400 milliseconds.
2.) The space you need depends on your configuration, the number of copyfields and if you store the data or if you only index the data. In my configuration, full DB is indexed, but only metadata is sored. So an 30GB DB needs 40 GB on for Solr/Lucene. Keep in mind, that if you like to (re)optimize your index, you need temporary 100% of the index-size again.
3.) If you migrate from MySQL fulltext-Index to Lucene/Solr, you save CPU Power. Using MySQL Fulltext needs much more CPU Power than Solr Fulltext search -> look at answer 1.)
4.) depends on the number of documents, the size of the documents and the disk-speed. Of course the CPU performance is very important. There is not a good scaling over multiple CPU's during index-time. 2 big cores are much more faster than 8 small cores.
Indexing 5 Million Docs (44GB) in my environment needs 2-3 hours on an dual core VM ware server.
5.) Migrating from MySQL Fulltext-Index to Lucene/Solr Fulltextindex was the best idea ever. ;-) But probably you have to redesign your application.
//Edit to answer the question "Will the Lucene Index get updated immediately after some Insert statements "
It depends on your SOlR configuration, but it is possible
Q1: Lucene is usually faster and more powerful in terms of features (if correctly implemented)
Q2: if you don't store the original content, it's usually 20-30% of the original (indexed) content
Q4: Depends on the size of your content that you want to index, on the amount of processing you'll be doing (you can have your own analyzers, etc), then your hardware... you'll have to do a benchmark. For one of my projects, last time it took 15min to build a 500MB index (out of the box performance, no tweaks attempted), for another, it took 3 days to build a huge 17GB index.

Categories