how much extra Space/RAM/CPU is used by apache solr? - java

I am using MySQL database for my webapp.
I need to search over multiple tables & multiple columns, it very similar like full text searching inside those columns.
I need know your experience of using any Full Text Search API (eg. solr/lucene/mapReduce/hadoop etc..) over using simple SQL in terms of :
Speed performance
Extra space usage
Extra CPU usage (is it continuously building index? )
How long it takes to build index or it get ready for use?
Please let me know your experience of using these frameworks.
Thanks a lot!

To answer your questions
1.) i have an database with round about 5 Million Docs. MySQL Fulltextsearch needs 2-3 Minutes. Solr/Lucene needs for the same search round about 200-400 milliseconds.
2.) The space you need depends on your configuration, the number of copyfields and if you store the data or if you only index the data. In my configuration, full DB is indexed, but only metadata is sored. So an 30GB DB needs 40 GB on for Solr/Lucene. Keep in mind, that if you like to (re)optimize your index, you need temporary 100% of the index-size again.
3.) If you migrate from MySQL fulltext-Index to Lucene/Solr, you save CPU Power. Using MySQL Fulltext needs much more CPU Power than Solr Fulltext search -> look at answer 1.)
4.) depends on the number of documents, the size of the documents and the disk-speed. Of course the CPU performance is very important. There is not a good scaling over multiple CPU's during index-time. 2 big cores are much more faster than 8 small cores.
Indexing 5 Million Docs (44GB) in my environment needs 2-3 hours on an dual core VM ware server.
5.) Migrating from MySQL Fulltext-Index to Lucene/Solr Fulltextindex was the best idea ever. ;-) But probably you have to redesign your application.
//Edit to answer the question "Will the Lucene Index get updated immediately after some Insert statements "
It depends on your SOlR configuration, but it is possible

Q1: Lucene is usually faster and more powerful in terms of features (if correctly implemented)
Q2: if you don't store the original content, it's usually 20-30% of the original (indexed) content
Q4: Depends on the size of your content that you want to index, on the amount of processing you'll be doing (you can have your own analyzers, etc), then your hardware... you'll have to do a benchmark. For one of my projects, last time it took 15min to build a 500MB index (out of the box performance, no tweaks attempted), for another, it took 3 days to build a huge 17GB index.

Related

Find shortest indexing time in solr

Description (for reference):
I want to index an entire drive of files : ~2TB
I'm getting the list of files (Using commons io library).
Once I have the list of files, I go through each file and extract readable data from that using Apache Tika
Once I have the data I'm indexing it using solr.
I'm using solrj with the java application
My question is: How do I decide what size of collection to pass to Solr. I've tried passing in different sizes with different results i.e. sometimes 150 documents per collection performs better than 100 documents but sometimes they do not. Is their an optimal way / configuration that you can tweak as this process has to be carried repeatedly.
Complications :
1) Files are stored on a network drive, retrieving the filenames/files takes some time too.
2) Both this program (java app) and solr itself cannot use more than 512MB of ram
I'll name just a few parameters of a number of them that may affect the indexing speed. Usually one needs to experiment with their own hardware, RAM, data processing complexity etc. to find the best combination, i.e. there is no single silver bullet for all.
Increase the number of segments during indexing to some large number. Say, 10k. This will make sure that merging of segments will not happen as often, as it would with the default number of segments 10. Merging the segments during the indexing contributes to slowing down the indexing. You will have to merge the segments after the indexing is complete for your search engine to perform. Also lower the number of segments back to something sensible, like 10.
Reduce the logging on your container during the indexing. This can be done using the solr admin UI. This makes the process of indexing faster.
Either reduce the frequency of auto-commits or switch them off and control the committing yourself.
Remove the warmup queries for the bulk indexing, don't auto-copy any cache entries.
Use ConcurrentUpdateSolrServer and if using SolrCloud, then CloudSolrServer.
comment out auto commit and tlogs and index on a single core. use multi threading in your solrj api (number of threads = no of cpu * 2) to hit a single core .
regards
Rajat

Replacing a huge dump file with an efficient lookup Java key-value text store

I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.

Mergesort or Database?

I have a rather complex database-query which gives me 30 million records - roughly 15 times the amount of data which would fit into memory. I need to access all records from the database sequentially (i.e. sorted). For performance reasons it is not possible to use an "order by" statement as the preparation of the ordered ResultSet uses roughly 40 minutes.
I see two possible options to solve my problem:
Dump the resulting data into an unordered file and use some form of merge-sort to arrive with a sorted file
Flatten data and dump it into a secondary database and reselect it using ordering mechanisms of the database.
Which would you prefer for reasons of elegance and performance?
If your choice is number two, do you have a suggestion for the database to use? Would you prefer SQLite, MySQL or Apache Derby?
For sorting large amounts of data, one solution is to sort them into blocks of data you can load. e.g a 30th (15 * 2) and sort those records. This will give you 30 sorted files.
Take the 30 sorted files and do a merge sort between them. (This requires at least 30 records in memory) You can process them as you sort them.
BTW: Its is also possible its time to buy a more powerful computer. You can buy a PC with 16 GB of memory and an SSD for close to $1000. For $2000 you can get a fast PC with 32 GB of memory. This could save you a lot of time. ;)
For the best performance, definitely option 1. Dumping the data to a flat file, sorting with a good external sort program, and then reading back in will use the minimum amount of resource from all the options. If you want to post specifics on the record length and system configuration (memory, disk speeds) I can let you know how long it should take.
The problem with option 2 is that it may simply reproduce the problem you currently have in another form. I can't tell from your post how complex your query is (how many tables you're joining), and it may be that a lot of your 40 minutes is being spent in the join. But even if that is the case, option 2 still has to do an external sort if your data is 15 times the size of available memory. The only databases that do this well are those that are designed to use a commercial external sort under the covers, so you're back to option 1 anyway.
As far as elegance is concerned, that's often in the eye of the beholder ;-). Personally, I find ultra-high performance elegant in its own right, but it's kinda subjective.
It's hard to say which method will be better for you. You really have to benchmark it.
A good idea is to increase your memory and keep an ordered index there. Then retrieve the data from disk/database (based on index of the item that you need)

MongoDB EC2 BenchMark Configuration

I am planning to use MongoDB on EC2 for my web application. Right now I am getting 8000 r/w on MongoDB. My MongoDB instance Type is m1-large.
For the optimum performance I have followed this sites:
site-1 site-2
I have tried a lot but failed to acheive that performace which is mentioned in above site. I want to know is there any other resource from where I can find the optimum EC2 performance Benchmark and some sort of configuration ?
Here are some things you can do to increase write performance on EC2:
Raid stuff
Raid 0 will get you better speed but make sure your application
doesn't require the mirroring provided by raid 10 (the recommended
setup)
Rais 10 is mirroring (raid 1) and striping (raid 0) together.
Note: There's a 2gbs rate limit between an individual EC2 instance and the
EBS service as a whole, so this is the maximum rate you could ever
possibly hope to write to EBS from a single node no matter how many
volumes you have.
Use sharding
The more the writes are spread around to different boxen the less
write load each one has.
Index only the importat stuff
Try to limit/reduce the amount of indexes you have. With each additional index, an insert or update will also incur additional writes (and therefore worse write performance).
Control Document size
Try to control document size so that they don't grow and therefore have to be moved to the end of the collection very often. If you have an array you are pushing to, try to fix (limit) the # of items and their individual size so that mongo can be smart about padding factor and help you out here.

instant searching in petabyte of data

I need to search over petabyte of data in CSV formate files. After indexing using LUCENE, the size of the indexing file is doubler than the original file. Is it possible to reduce the indexed file size??? How to distribute LUCENE index files in HADOOP and how to use in searching environment? or is it necessary, should i use solr to distribute the LUCENE index??? My requirement is doing instant search over petabyte of files....
Hadoop and Map Reduce are based on batch processing models. You're not going to get instant response speed out of them, that's just not what the tool is designed to do. You might be able to speed up your indexing speed with Hadoop, but it isn't going to do what you want for querying.
Take a look at Lucandra, which is a Cassandra based back end for Lucene. Cassandra is another distributed data store, developed at Facebook if I recall, designed for faster access time in a more query oriented access model than hadoop.
Any decent off the shelf search engine (like Lucene) should be able to provide search functionality over the size of data you have. You may have to do a bit of work up front to design the indexes and configure how the search works, but this is just config.
You won't get instant results but you might be able to get very quick results. The speed will probably depend on how you set it up and what kind of hardware you run on.
You mention that the indexes are larger than the original data. This is to be expected. Indexing usually includes some form of denormalisation. The size of the indexes is often a trade off with speed; the more ways you slice and dice the data in advance, the quicker it is to find references.
Lastly you mention distributing the indexes, this is almost certainly not something you want to do. The practicalities of distributing many petabytes of data are pretty daunting. What you probably want is to have the indexes sat on a big fat computer somewhere and provide search services on the data (bring the query to the data, don't take the data to the query).
If you want to avoid changing your implementation, you should decompose your lucene index into 10, 20 or even more indices and query them in parallel. It worked in my case (I created 8 indices), I had 80 GB of data, and I needed implement search which works on a developer machine (Intel Duo Core, 3GB RAM).

Categories