lucene set boost on fields at search time - java

Is it possible to adjust the boost of a field with the Query object before running the search?
I know the proper way to do it is to change the fields boost during indexing, but it takes about 4 days to make an index and was just wondering if there's a quick hack i can do for now.
also i have tried hardcoding in the boost to the search query, ie
AND field(this that other)^7
and that works, and it would be the end of it, EXCEPT i want to reduce the relevance of this part of the query,
i want
AND field(this that other)^.1
but i get empty results.
thanks

You can extend Similarity and in the Searcher, use
setSimilarity(Similarity)
By extending Similarity, you can adapt the scoring mechanism in Lucene to your needs.
EDIT:
More specifically, you can override the lengthNorm method in Similarity (or a subclass thereof):
public float lengthNorm(String fieldName, int numTokens){
return fieldWeights.get(fieldName)*super.lengthNorm(fieldName, numTokens);
}
fieldWeights could be a Map attribute in which you specify the weight you want to attach to each field. If you keep a reference to fieldWeights somewhere, you can change the field weights to whatever you want just before you perform a search (But do this for only one query at a time, to experiment).

Why not just boost the terms you want with a higher value and leave the ones you are trying to un-boost at zero?

Related

How can I get the size of Solr Facet results?

There is a multi-value field in my schema named XXX. And it may be more 10,0000 documents in my Solr, I want to get how many values exist in XXX without any duplication.
For now, I use facet.field=XXX&facet.limit=-1 to get the facet results size. It will spend a lot of time and sometimes occur Read Timeout.
What I want for the facet results is only the 'size', I don't care about the contents.
By the way, I use Solr 5.0, is there any other better solution to solve my requirement?
The index does maintain a list of unique terms, since that is how the inverted index works. It is also very very fast to compute and return, unlike faceting. If your values are single terms, then that could be a way of getting to what you want. There is a way to get unique terms, given that the TermsComponent is enabled in your solrconfig.xml. For example:
http://localhost:8983/solr/corename/terms?q=*%3A*&wt=json&indent=true&terms=true&terms.fl=XXX
Would return a list of all unique terms, and their counts:
{
"responseHeader":{
"status":0,
"QTime":0},
"terms":{
"XXX":[
"John Backus",3,
"Ada Lovelace",3,
"Charles Babbage",2,
"John Mauchly",1,
"Alan Turing",1
]
}
}
The length of this list is the amount of unique terms, in the example that would be 5. Unfortunately the API doesn't provide a way to just ask for the count, without returning the list of terms, so while it has speed advantage in generating the list, the amount of time required to return full list gives it a similar drawback to the facets approach. Also, the returned list may become quite long.
Check out https://wiki.apache.org/solr/TermsComponent for the API details.

Finding number of unique terms over multiple fields

I need to find number (or list) of unique terms over a combination of two or more fields in Lucene-Java. I am using Java libraries for Lucene 4.1.0. I checked questions such as this and this, but they discuss finding list of unique terms from a single (specific) field, or over all the fields (no subset).
For example, I am interested in number(unique(height, gender)) rather than number(unique(height)), or number(unique(gender)).
Given the data:
height,gender
1,M
2,F
3,M
3,F
4,M
4,F
number(unique(height)) is 4, number(unique(gender)) is 2 and number(unique(gender,height)) is 6.
Any help will be greatly appreciated.
Thanks!
If you have predefined multiple fields then the simplest and quickest (in search terms) would be to index a combined field, i.e. heightGender (1.23:male). You can then just count the unique terms in this field, however this doesn't offer any flexibility at search time.
A more flexible approach would be to use facets (https://lucene.apache.org/core/4_1_0/facet/index.html). You would then constrain you query to each value of one field (e.g. Gender (male/female)) and retrieve all the values (and document counts) of the other field.
However if you do not have the ability to change the indexing process then you are left with doing a brute force search using Boolean queries to find the number of documents in the index for all combinations of the field values in which you are interested. I presume you are only counting combinations where the number of documents is non-zero.
It is worth noting that this question is exactly what Solr Pivot Facets address (http://lucidworks.com/blog/pivot-facets-inside-and-out/)

Modifying .tim and .tip files in Lucene Index

I have a Lucene application with multiple indices in which the relevancy scoring suffers due to differences in the term frequencies across the different indices. My understanding is that the Term Dictionary (.tim file) contains "term statistics" such as the document frequency statistics on each term. I was thinking that one approach might be to modify the .tim files for each index (and related segments) and update the "term statistics". Is it possible to overwrite or modify the .tim and .tip files in such a way?
relevancy scoring suffers
From the FAQ:
score values are meaningful only for purposes of comparison between
other documents for the exact same query and the exact same index.
when you try to compute a percentage, you are setting up an implicit
comparison with scores from other queries.
Is it possible? I suppose, but it strikes me as about as good an idea as attempting to change an application by directly modifying the compiled binaries.
If you need very specific things from scoring, then you should generally implement a Similarity that does what you need. Extending TFIDFSimilarity is often a good idea. Really not clear on what the exact problem is, so I can't provide any more specific guidance than that, but perhaps that provides a point in the right general direction.

How to compute norm at search time in Lucene 3.6?

I need to perform a query over a field, one time taking into account the norm and the other without affecting the score by the norm.
What i have done is indexing the field two times with two different names as follow:
"field" with Field.omitNorms(false);
"field_noNorm" with Field.omitNorms(true);
This solution led me to achieve my objective, but it has duplicated the dimension of the index, and now that the index size is becoming critical, i need to find a smarter solution, also affecting the query time when searching on the field without norm.
It is possibile to store a single normalized field and multiply the inverce of the norm for each field at query time to remove its effect on the final score?
And, if so, what is the fastest way to retrieve the norm of document during the search time?
I solved the problem reading the norm at index time with:
searcher.getIndexReader().norms(field);

In Lucene, can I search one index but use IDF from another one?

I'm building a system where I want to show only results indexed in the past few days.
Furthermore, I don't want to maintain a giant index with a million documents if I only want to return results from a couple of days (thousands of documents).
On the other hand, my system heavily relies that the occurrences of terms in documents stored in the index have a realistic distribution (consequently: realistic IDF).
That said, I would like to use a small index to return results, but I want to compute documents score using a IDF from a much greater Index (or even an external source).
The Similarity API doesn't seem to allow me to do this. The idf method does not receive as parameter the term being used.
Another possibility is to use TrieRangeQuery to make sure the documents shown are within the last couple of days. Again, I rather not mantain a larger index. Also this kind of query is not cheap.
You should be able to extend IndexReader and override the docFreq() methods to provide whatever values you'd like. One thing this implementation can do is open two IndexReader instances -- one for the small index and one for the large index. All the methods are delegated to the small IndexReader, except for docFreq(), which is delegated to the large index. You'll need to scale the value returned, i.e.
int myNewDocFreq = bigIndexReader.docFreq(t) / bigIndexReader.maxDoc() * smallIndexReader.maxDoc()

Categories