I need to perform a query over a field, one time taking into account the norm and the other without affecting the score by the norm.
What i have done is indexing the field two times with two different names as follow:
"field" with Field.omitNorms(false);
"field_noNorm" with Field.omitNorms(true);
This solution led me to achieve my objective, but it has duplicated the dimension of the index, and now that the index size is becoming critical, i need to find a smarter solution, also affecting the query time when searching on the field without norm.
It is possibile to store a single normalized field and multiply the inverce of the norm for each field at query time to remove its effect on the final score?
And, if so, what is the fastest way to retrieve the norm of document during the search time?
I solved the problem reading the norm at index time with:
searcher.getIndexReader().norms(field);
Related
So I am actually been assigned to write algorithms on filtering/searching.
Task : Filter: search and list objects that fulfill specified attribute(s)
Say The whole system is a student registration record system.
I have data as shown below. I will need to filter and search by these attributes say search/filter by gender or student name or date of birth etc.
Student Name
, Gender
, Date Of Birth
, Mobile No
Is there specific efficient algorithm formula or method for each of these field.
Example , strings and integers each has their own type of efficient search algorithm right?
Here's what I am going to do.
I am going to code a binary search algorithm for searching/filtering based on these fields above.
That's it. But yeah that's easy to be honest.
But I am just curious like what's the proper and appropriate coding approach for a efficient search/filter algorithm for each of these fields will you guys do?
I will not be using sequential search algorithm obviously as this will involve huge data so I am not going to iterate each of these data to downgrade efficiency performance.
Sequential search algorithm will be used when needed if data is less.
Searching is a very broad topic and it completely depends upon your use case.
while building an efficient Searching algorithm you should take below factors into consideration
What's the size of your data? -is it fixed or it keeps varying
periodically?
How often you are going to Insert/modify/delete
your data?
Is your data sorted or unsorted?
Do you need a prefix based search like autosearch,autocomplete,longest prefix search etc?
Now let's think about the solution/approach
if your data is less and unsorted as you can try Linear
Search(which has O(n)time complexity where "n" is size of your
data/array)
if your data is already sorted which is not always the case you can
use Binary search as it's complexity is 0(log n). if your
data is not sorted then sorting the data again takes
(nlogn)~typically if you are using Java,Arrays.sort() by default uses Merge sort or Quick sort which is (nlogn).
if faster retrieval is the main object you can think of HashMaps or HashMaps. the elements of Hashmap are indexed by Hashcode, the
time to search for any element would almost be 1 or constant time(if
your hash function implementation is good)
Prefix based search :since you mentioned about searching by Names,you also have the option of using
"Tries" data structure.
Tries are excellent option if you are performing Insert/Delete/Update functionalities frequently .
Lookup of an elements in a Trie is 0(k) where "k" is the length of the string to be searched.
Since you have registration data where insert,update,deletion is common TRIES Data Structure is a good option to consider.
Also,check this link to choose between Tries and HashTables TriesVsMaps
Below is the sample representation of Tries(img src:Hackerearth)
I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!
If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)
I use Solr (SolrCloud) to index and search my tweets. There are about 16 million tweets and the index size is approximately 3 GB. The tweets are indexed in real time as they come so that real time search is enabled. Currently I use lowercase field type for my tweet body field. For a single search term in the search, it is taking around 7 seconds and with addition of each search term, time taken for search is linearly increasing. 3GB is the maximum RAM allocated for the solr process. Sample solr search query looks like this
tweet_body:*big* AND tweet_body:*data* AND tweet_tag:big_data
Any suggestions on improving the speed of searching? Currently I run only 1 shard which contains the entire tweet collection.
The query tweet_body:*big* can be expected to perform poorly. Trailing wildcards are easy, Leading Wildcards can be readily handled with a ReversedWildcardFilterFactory. Both, however, will have to scan every document, rather than being able to utilize the index to locate matching documents. Combining the two approaches would only allow you to search:
tweet_body:*big tweet_body:big*
Which is not the same thing. If you really must search for terms with a leading AND trailing wildcard, I would recommend looking into indexing your data as N-grams.
I wasn't previously aware of it, but it seems the lowercase field type is a Lowercase filtered KeywordAnalyzer. This is not what you want. That means the entire field is treated as a single token. Good for identification numbers and the like, but not for a body of text you wish to perform a full text search on.
So yes, you need to change it. text_general is probably appropriate. That will index a correctly tokenized field, and you should be able to performt he query you are looking for with:
tweet_body:big AND tweet_body:data AND tweet_tag:big_data
You will have to reindex, but there is no avoiding that. There is no good, performant way to perform a full text search on a keyword field.
Try using filter queries,as filter queries runs in parallel
I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.
You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.
int myNewDocFreq = bigIndexReader.docFreq(t) / bigIndexReader.maxDoc() * smallIndexReader.maxDoc()
Is it possible to adjust the boost of a field with the Query object before running the search?
I know the proper way to do it is to change the fields boost during indexing, but it takes about 4 days to make an index and was just wondering if there's a quick hack i can do for now.
also i have tried hardcoding in the boost to the search query, ie
AND field(this that other)^7
and that works, and it would be the end of it, EXCEPT i want to reduce the relevance of this part of the query,
i want
AND field(this that other)^.1
but i get empty results.
thanks
You can extend Similarity and in the Searcher, use
setSimilarity(Similarity)
By extending Similarity, you can adapt the scoring mechanism in Lucene to your needs.
EDIT:
More specifically, you can override the lengthNorm method in Similarity (or a subclass thereof):
public float lengthNorm(String fieldName, int numTokens){
return fieldWeights.get(fieldName)*super.lengthNorm(fieldName, numTokens);
}
fieldWeights could be a Map attribute in which you specify the weight you want to attach to each field. If you keep a reference to fieldWeights somewhere, you can change the field weights to whatever you want just before you perform a search (But do this for only one query at a time, to experiment).
Why not just boost the terms you want with a higher value and leave the ones you are trying to un-boost at zero?