When to do indexing in lucene

When to do indexing in lucene - java

I have REST service which works with data from database (mongodb). I want to add apache lucene library to implement full text search.
I never used Lucene before so trying to understand how it works be checking tutorials, but still one thing is unclear for me:
When to do indexing of DB data? I have DB, some data is added and removed more often, some is updated rarely. What should be structure that I could do search requests by all up to date data.
Should I update indexes on every data update, or it will be done automatically, and enough to index once? If reindexing should be made, so how often?

If you want live data to be searched then you should add, update and delete data in lucene index at the same time you perform add, update and delete data in your database.
It will perfectly fine just for indexing but do not optimize your index for every operation.
You can optimize your index once in a day or according to your use. Optimizing index will help you for faster search result.
Refer this tutorial to just begin with basic application of lucene.

You can try MongoDBs own Feature for this (see Mongo Docs). This has probably not the flexibility and is not as mighty as Lucene, but it Comes for free.
You really asked the problematic question: "When do indexing?". And the answer depends heavy on your requirements. However, you can look at this post to see how it is technically done: offline, i.e. you will always be more or less behind in indexing.

Related

Solr search query time increases as the start keeps on increasing

I have currently over 25 million documents in Solr and the volume will gradually increase. I need to search for the records over such big size of Solr indexes. The query response time is low when the start is low, e.g 0. But as the start increases, e.g 100000 , searching in Solr is also taking time. How can i make the search faster even with high start number over large data sets in Solr? The rows remain constant only the start keeps on increasing. I don't want the response time to increase as the start keeps on increasing instead want the result returned for start=100000 should take the same time as for start=0 with say suppose rows=1000 as this is performance issue. Any help would be appreciated.

The problem you are facing is called Deep Paging. There is a good article about it on solr.pl and an incomplete issue on Solr's tracker.
The solution mentioned in the article will require you to sort your result, if that is not feasible for you the solution will not work. The idea is to sort by a stable attribute, in the article that is price and then filter with a price range, like fq=price:[9000+TO+10000].
If you combine that fq with a suitable start - like start=100030 - you will get better performance, as solr will not collect the documents that do not match the fq.
But you will need to make at least one query in advance to fetch the suitable meta data, like how many docs have been found at all.

With the release of Solr 4.7 a new feature has been introduced Cursors. This has been done exactly to address the problem of Deep Paging. If you still have the problem and you may perform the upgrade to Solr 4.7 this is the best option for you.
Some references about deep paging with Solr
https://lucene.apache.org/solr/guide/7_7/pagination-of-results.html#performance-problems-with-deep-paging

Result Set to Multi Hash Map

I have a situation here. I have a huge database with >10 columns and millions of rows. I am using a matching algorithm which matches each input records with the values in database.
The database operation is taking lot of time when there are millions of records to match. I am thinking of using a multi-hash map or any resultset alternative so that i can save the whole table in memory and prevent hitting database again....
Can anybody tell me what should i do??

I don't think this is the right way to go. You are trying to do the database's work manually in Java. I'm not saying that you are not capable of doing this, but most databases have been developed for many years and are quite good in doing exactly the thing that you want.
However, databases need to be configured correctly for a given type of query to be executed fast. So my suggestion is that you first check whether you can tweak the database configuration to improve the performance of the query. The most common thing is to add the right indexes to your table. Read How MySQL Uses Indexes or the corresponding part of the manual of your particular database for more information.
The other thing is, if you have so much data storing everything in main memory is probably not faster and might even be infeasible. Not to say that you have to transfer the whole data first.
In any case, try to use a profiler to identify the bottleneck of the program first. Maybe the problem is not even on the database side.

Best way to sort the data : DB Query or in Application Code

I have a Mysql table with some data (> million rows). I have a requirement to sort the data based on the below criteria
1) Newest
2) Oldest
3) top rated
4) least rated
What is the recommended solution to develop the sort functionality
1) For every sort reuest execute a DBQuery with required joins and orderBy conditions and return the sorted data
2) Get all the data (un sorted) from table, put the data in cache. Write custom comparators (java) to sort the data.
I am leaning towards #2 as the load on DB is only once. Moreover, application code is better than DBQuery.
Please share your thoughts....
Thanks,
Karthik

Do as much in the database as you can. Note that if you have 1,000,000 rows, returning all million is nearly useless. Are you going to display this on a web site? I think not. Do you really care about the 500,000th least popular post? Again, I think not.
So do the sorts in the database and return the top 100, 500, or 1000 rows.

It's much faster to do it in the database:
1) the database is optimized for I/O operations, and can use indices, and other DB optimizations to improve the response time
2) taking the data from the database to the application will get all data into memory. The app will have to look all the data to redorder it without optimized algorithms
3) the database only takes the minimun necessary data into mamemory, which can be much less than all the data whihc has to be moved to java
4) you can always create extra indices on the database to improve the query performance.

I would say that operation on DB will be always faster. You should ensure that caching on DB is ON and working properly. Ensure that you are not using now() in your query because it will disable mysql cache. Take a look here how mysql query cache works. In basic. Query is cached based on string so if query string differs every time you fetch no cache is used.

AFAIK usually it should run faster if you let the DB sort your data.
And regarding code on application level vs db level I would agree in the case of stored procedures but sorting in SELECTs is fine IMHO.
If you want to show the data to the user also consider paging (in which case you're better off with sorting on the db level anyway).

Fetching a million rows from the database sounds like a terrible idea. It will generate a lot of networking traffic and require quite some time to transfer all the data. Not mentioning amounts of memory you would need to allocate in your application for storing million of objects.
So if you can fetch only a subset with a query, do that. Overall, do as much filtering as you can in the database.
And I do not see any problem in ordering in a single queue. You can always use UNION if you can't do it as one SELECT.

You do not have four tasks, you have two:
sort newest IS EQUAL TO sort oldest
AND
sort top rated IS EQUAL TO sort least rated.
So you need to make two calls to db. Yes sort in db. then instead of calling to sort every time, do this:
1] track the timestamp of the latest record in the db
2] before calling to sort and retrieve entire list, check if date has changed
3] if date has not changed, use the list you have in memory
4] if date has changed, update the list

I know this is an old thread, but it comes up in my search, so I'd like to post my opinion.
I'm a bit old school, but for that many rows, I would consider dumping the data from your database (each RDBMS has it's own method. Looks like MySQLDump command for MySQL: Link )
You can then process this with sorting algorithms or tools that are available in your java libraries or operating system.
Be careful about the work your asking your database to do. Remember that it has to be available to service other requests. Don't "bring it to it's knees" servicing only one request, unless it's a nightly batch cycle type of scenario and you're certain it won't be asked to do anything else.

java - MongoDB + Solr performances

I've been looking around a lot to see how to use MongoDB in combination with Solr, and some questions here have partial responses, but nothing really concrete (more like theories). In my application, I will have lots and lots of documents stored in MongoDB (maybe up to few hundred millions), and I want to implement full-text searches on some properties of those documents, so I guess Solr is the best way to do this.
What I want to know is how should I configure/execute everything so that it has good performances? right now, here's what I do (and I know its not optimal):
1- When inserting an object in MongoDB, I then add it to Solr
SolrServer server = getServer();
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
server.add(document);
server.commit();
2- When updating a property of the object, since Solr cannot update just one field, first I retrieve the object from MongoDB then I update the Solr index with all properties from object and new ones and do something like
StreamingUpdateSolrServer update = new StreamingUpdateSolrServer(url, 1, 0);
SolrInputDocument document = new SolrInputDocument();
document.addField("id", documentId);
...
update.add(document);
update.commit();
3- When querying, first I query Solr and then when retrieving the list of documents SolrDocumentList I go through each document and:
get the id of the document
get the object from MongoDB having the same id to be able to retrieve the properties from there
4- When deleting, well I haven't done that part yet and not really sure how to do it in Java
So anybody has suggestions on how to do this in more efficient ways for each of the scenarios described here? like the process to do it in a way that it won't take 1hour to rebuild the index when having a lot of documents in Solr and adding one document at a time? my requirements here are that users may want to add one document at a time, many times and I'd like them to be able to retrieve it right after

Your approach is actually good. Some popular frameworks like Compass are performing what you describe at a lower level in order to automatically mirror to the index changes that have been performed via the ORM framework (see http://www.compass-project.org/overview.html).
In addition to what you describe, I would also regularly re-index all the data which lives in MongoDB in order to ensure both Solr and Mongo are sync'd (probably not as long as you might think, depending on the number of document, the number of fields, the number of tokens per field and the performance of the analyzers : I often create index from 5 to 8 millions documents (around 20 fields, but text fields are short) in less than 15 minutes with complex analyzers, just ensure your RAM buffer is not too small and do not commit/optimize until all documents have been added).
Regarding performance, a commit is costly and an optimize is very costly. Depending on what matters the most to you, you could change the value of mergefactor in Solrconfig.xml (high values improve write performance whereas low values improve read performance, 10 is a good value to start with).
You seem to be afraid of the index build time. However, since Lucene indexes storage is segment-based, the write throughput should not depend too much on the size of the index (http://lucene.apache.org/java/2_3_2/fileformats.html). However, the warm-up time will increase, so you should ensure that
there are typical (especially for sorts in order to load the fieldcaches) but not too complex queries in the firstSearcher and newSearcher parameters in your solrconfig.xml config file,
useColdSearcher is set to
false in order to have good search performance, or
true if you want changes performed to the index to be taken faster into account at the price of a slower search.
Moreover, if it is acceptable for you if the data becomes searchable only a few X milliseconds after it has been written to MongoDB, you could use the commitWithin feature of UpdateHandler. This way Solr will have to commit less often.
For more information about Solr performance factors, see
http://wiki.apache.org/solr/SolrPerformanceFactors
To delete documents, you can either delete by document ID (as defined in schema.xml) or by query :
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/SolrServer.html

You can also wait for more documents and indexing them only each X minutes. (Of course this highly depend of your application & requirements)
If your documents are small and you don't need all data (which are stored in MongoDB) you can put only the field you need in the Solr Document by storing them but not indexing
<field name="nameoyourfield" type="stringOrAnyTypeYouuse"indexed="false"stored="true"/>

Strategy for locale sensitive sort with pagination

I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.

Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.

How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)

I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.

You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.

This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.