All I read and understand about running multiple ES nodes is to enable index replication and scaling. I was wondering if it could help us to make indexing faster for large number of files. I have two questions and they are as follows:
Question 1: Would it be accurate to think that using multiple ES nodes would allow us to index multiple times faster?
Question 2: what effect does it have on indexing if I keep enable all nodes as a data node? on the other hand what effect does it have on indexing if I have few non-data nodes (e.g. one dedicate Master and one dedicate Client node) with few data nodes? Which will be better in terms of speed and scaling?
Answer1: No.
The speed of indexing will in fact decrease if you enable replication (though it may increase search performance). You can look at this question for improving indexing performance.
Answer2: It depends (if no replica then same).
During indexing the data will go only to data nodes. Your cluster state will contain information about which nodes is data node and route the request accordingly. The performance impact will only be because of the node receiving the request has to reroute/forward the request to the data nodes
If you are adding machines without increasing the number of replicas you will get a better performance during indexing. It is not surprising since you are adding more resources while the amount of work to be done remains pretty much the same.
In our environment we are using 20 nodes on production and 5-10 nodes on debug. Both environments hold the same volume of data. Since ES updates speed (We are using groovy scripts to merge new documents to existing documents) is our primary bottleneck, we are able to see much better performance in our production environment in oppose to other environments.
You already got some useful links at other answers to your question. I can add that in our case the 3 most significant factors in data upload improvements were: Reducing the refresh_interval, increasing the merge_factor and using Elastic-Hadoop plugin (We upload the data from Spark) that handles all the major data transfer optimisation on the application level.
Every one of those steps have its own disadvantages, so read the manuals carefully before changing the configurations.
Related
Description (for reference):
I want to index an entire drive of files : ~2TB
I'm getting the list of files (Using commons io library).
Once I have the list of files, I go through each file and extract readable data from that using Apache Tika
Once I have the data I'm indexing it using solr.
I'm using solrj with the java application
My question is: How do I decide what size of collection to pass to Solr. I've tried passing in different sizes with different results i.e. sometimes 150 documents per collection performs better than 100 documents but sometimes they do not. Is their an optimal way / configuration that you can tweak as this process has to be carried repeatedly.
Complications :
1) Files are stored on a network drive, retrieving the filenames/files takes some time too.
2) Both this program (java app) and solr itself cannot use more than 512MB of ram
I'll name just a few parameters of a number of them that may affect the indexing speed. Usually one needs to experiment with their own hardware, RAM, data processing complexity etc. to find the best combination, i.e. there is no single silver bullet for all.
Increase the number of segments during indexing to some large number. Say, 10k. This will make sure that merging of segments will not happen as often, as it would with the default number of segments 10. Merging the segments during the indexing contributes to slowing down the indexing. You will have to merge the segments after the indexing is complete for your search engine to perform. Also lower the number of segments back to something sensible, like 10.
Reduce the logging on your container during the indexing. This can be done using the solr admin UI. This makes the process of indexing faster.
Either reduce the frequency of auto-commits or switch them off and control the committing yourself.
Remove the warmup queries for the bulk indexing, don't auto-copy any cache entries.
Use ConcurrentUpdateSolrServer and if using SolrCloud, then CloudSolrServer.
comment out auto commit and tlogs and index on a single core. use multi threading in your solrj api (number of threads = no of cpu * 2) to hit a single core .
regards
Rajat
I am using MySQL database for my webapp.
I need to search over multiple tables & multiple columns, it very similar like full text searching inside those columns.
I need know your experience of using any Full Text Search API (eg. solr/lucene/mapReduce/hadoop etc..) over using simple SQL in terms of :
Speed performance
Extra space usage
Extra CPU usage (is it continuously building index? )
How long it takes to build index or it get ready for use?
Please let me know your experience of using these frameworks.
Thanks a lot!
To answer your questions
1.) i have an database with round about 5 Million Docs. MySQL Fulltextsearch needs 2-3 Minutes. Solr/Lucene needs for the same search round about 200-400 milliseconds.
2.) The space you need depends on your configuration, the number of copyfields and if you store the data or if you only index the data. In my configuration, full DB is indexed, but only metadata is sored. So an 30GB DB needs 40 GB on for Solr/Lucene. Keep in mind, that if you like to (re)optimize your index, you need temporary 100% of the index-size again.
3.) If you migrate from MySQL fulltext-Index to Lucene/Solr, you save CPU Power. Using MySQL Fulltext needs much more CPU Power than Solr Fulltext search -> look at answer 1.)
4.) depends on the number of documents, the size of the documents and the disk-speed. Of course the CPU performance is very important. There is not a good scaling over multiple CPU's during index-time. 2 big cores are much more faster than 8 small cores.
Indexing 5 Million Docs (44GB) in my environment needs 2-3 hours on an dual core VM ware server.
5.) Migrating from MySQL Fulltext-Index to Lucene/Solr Fulltextindex was the best idea ever. ;-) But probably you have to redesign your application.
//Edit to answer the question "Will the Lucene Index get updated immediately after some Insert statements "
It depends on your SOlR configuration, but it is possible
Q1: Lucene is usually faster and more powerful in terms of features (if correctly implemented)
Q2: if you don't store the original content, it's usually 20-30% of the original (indexed) content
Q4: Depends on the size of your content that you want to index, on the amount of processing you'll be doing (you can have your own analyzers, etc), then your hardware... you'll have to do a benchmark. For one of my projects, last time it took 15min to build a 500MB index (out of the box performance, no tweaks attempted), for another, it took 3 days to build a huge 17GB index.
I am planning to use MongoDB on EC2 for my web application. Right now I am getting 8000 r/w on MongoDB. My MongoDB instance Type is m1-large.
For the optimum performance I have followed this sites:
site-1 site-2
I have tried a lot but failed to acheive that performace which is mentioned in above site. I want to know is there any other resource from where I can find the optimum EC2 performance Benchmark and some sort of configuration ?
Here are some things you can do to increase write performance on EC2:
Raid stuff
Raid 0 will get you better speed but make sure your application
doesn't require the mirroring provided by raid 10 (the recommended
setup)
Rais 10 is mirroring (raid 1) and striping (raid 0) together.
Note: There's a 2gbs rate limit between an individual EC2 instance and the
EBS service as a whole, so this is the maximum rate you could ever
possibly hope to write to EBS from a single node no matter how many
volumes you have.
Use sharding
The more the writes are spread around to different boxen the less
write load each one has.
Index only the importat stuff
Try to limit/reduce the amount of indexes you have. With each additional index, an insert or update will also incur additional writes (and therefore worse write performance).
Control Document size
Try to control document size so that they don't grow and therefore have to be moved to the end of the collection very often. If you have an array you are pushing to, try to fix (limit) the # of items and their individual size so that mongo can be smart about padding factor and help you out here.
I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.
I need to search over petabyte of data in CSV formate files. After indexing using LUCENE, the size of the indexing file is doubler than the original file. Is it possible to reduce the indexed file size??? How to distribute LUCENE index files in HADOOP and how to use in searching environment? or is it necessary, should i use solr to distribute the LUCENE index??? My requirement is doing instant search over petabyte of files....
Hadoop and Map Reduce are based on batch processing models. You're not going to get instant response speed out of them, that's just not what the tool is designed to do. You might be able to speed up your indexing speed with Hadoop, but it isn't going to do what you want for querying.
Take a look at Lucandra, which is a Cassandra based back end for Lucene. Cassandra is another distributed data store, developed at Facebook if I recall, designed for faster access time in a more query oriented access model than hadoop.
Any decent off the shelf search engine (like Lucene) should be able to provide search functionality over the size of data you have. You may have to do a bit of work up front to design the indexes and configure how the search works, but this is just config.
You won't get instant results but you might be able to get very quick results. The speed will probably depend on how you set it up and what kind of hardware you run on.
You mention that the indexes are larger than the original data. This is to be expected. Indexing usually includes some form of denormalisation. The size of the indexes is often a trade off with speed; the more ways you slice and dice the data in advance, the quicker it is to find references.
Lastly you mention distributing the indexes, this is almost certainly not something you want to do. The practicalities of distributing many petabytes of data are pretty daunting. What you probably want is to have the indexes sat on a big fat computer somewhere and provide search services on the data (bring the query to the data, don't take the data to the query).
If you want to avoid changing your implementation, you should decompose your lucene index into 10, 20 or even more indices and query them in parallel. It worked in my case (I created 8 indices), I had 80 GB of data, and I needed implement search which works on a developer machine (Intel Duo Core, 3GB RAM).