Hibernate search and elastic search analysers for exact search - java

I have StandardAnalyser above the field in entity
#Field(name = "myField", index = Index.YES, analyze = Analyze.YES, analyzer = #Analyzer(impl = StandardAnalyzer.class)),
But I wonder how to then make it possible to:
1) Search by words like: I like bananas -> ["I", "like", "bananas"] (which my analyser currently allows me to do)
2) Search by exact input: "I like bananas" -> "I like bananas" (which StandardAnalyser does not allow and the fitting change would be to impl = KeywordAnalyser)
Should I change my analyser or maybe in java code, based on input (if it starts and ends with double quotes) change the way of searching?
Regards

Maybe you want a phrase query? A phrase query looks for a sequence of tokens, instead of just one token.
Query luceneQuery = queryBuilder.phrase()
.onField("myField")
.sentence("I like bananas")
.createQuery();
See https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_phrase_queries
Phrase queries are still sort of "fuzzy", however, since the analyzer is still involved: case will be ignored, etc. It's just about looking for a sequence of (analyzed) tokens.
If you really need exact search, including case sensitivity, you can simply declare two fields:
#Field(name = "myField", index = Index.YES, analyze = Analyze.YES, analyzer = #Analyzer(impl = StandardAnalyzer.class))
#Field(name = "myField_exact", index = Index.YES, analyze = Analyze.NO)
Then you can target either myField or myField_exact at query time, depending on your needs.
Of course, you will need to reindex your data before myField_exact becomes available.

Related

Why is my Lucene only matching field if I add wildcard to end of value

Why is my Lucene 4.10 only matching field if I add wildcard to end of value ?
I have a field called acoustid defined with KeywordAnalyzer
ACOUSTID("acoustid",IndexFieldTypes.TEXT_NOT_STORED_ANALYZED_NO_NORMS, new KeywordAnalyzer()),
If I run my query like this I get no matches
query=acoustid:ae8f4538-9971-41b3-a6d0-bbca1c13e855
but if add a wildcard I get correct matches
query=acoustid:ae8f4538-9971-41b3-a6d0-bbca1c13e855*
Note the query is escaped for Lucene before it gets to Lucene
I have another field (reid) that also stores guids using KeywordAnalyzer
and that works fine.
query=reid:425cf29a-1490-43ab-abfa-7b17a2cec351
I cannot understand this because I don't see how there can be any additional data after the value, and my unit tests such as
#Test
public void testFindReleaseByAcoustId() throws Exception {
Results res = ss.search("acoustid:1d9e8ed6-3893-4d3b-aa7d-6cd79609e389", 0, 10);
assertEquals(1, res.getTotalHits());
assertEquals("1d9e8ed6-3893-4d3b-aa7d-6cd79609e386", getReleaseId(res.results.get(0).getDoc()));
}
it works fine.
What should my next step be ?
Update
Just remembered I added an option to explain the query, so this is with wildcard
Query:+acoustid:ae8f4538-9971-41b3-a6d0-bbca1c13e855* +src:1
0:Score:100.0
ba938fab-22b1-42ba-9bda-47261bc0569d:Now That's What I Call the 90s
2.954172 = (MATCH) sum of:
0.3385043 = (MATCH) ConstantScore(acoustid:ae8f4538-9971-41b3-a6d0-bbca1c13e855), product of:
1.0 = boost
0.3385043 = queryNorm
2.6156676 = (MATCH) weight(src:1 in 9) [DefaultSimilarity], result of:
2.6156676 = score(doc=9,freq=1.0 = termFreq=1.0 ), product of:
0.9409648 = queryWeight, product of:
2.779772 = idf(docFreq=2052700, maxDocs=12169449)
0.3385043 = queryNorm
2.779772 = fieldWeight in 9, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
2.779772 = idf(docFreq=2052700, maxDocs=12169449)
1.0 = fieldNorm(doc=9)
and this is without
Query:+(acoustid:ae8f4538 acoustid:9971 acoustid:41b3 acoustid:a6d0 acoustid:bbca1c13e855) +src:1
so clearly the '-' hyphens are causing an issue breaking down the terms.
My working query on the similar reid gives
Query:+reid:c3c0e462-1606-40dc-9667-1b26b9fb44c5 +src:1
0:Score:100.0
c3c0e462-1606-40dc-9667-1b26b9fb44c5:Liquid Tension Experiment
16.852135 = (MATCH) sum of:
16.39361 = (MATCH) weight(reid:c3c0e462-1606-40dc-9667-1b26b9fb44c5 in 552496) [DefaultSimilarity], result of:
16.39361 = score(doc=552496,freq=1.0 = termFreq=1.0 ), product of:
0.9863018 = queryWeight, product of:
16.621292 = idf(docFreq=1, maxDocs=12169449)
0.059339657 = queryNorm
16.621292 = fieldWeight in 552496, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
16.621292 = idf(docFreq=1, maxDocs=12169449)
1.0 = fieldNorm(doc=552496)
0.4585254 = (MATCH) weight(src:1 in 552496) [DefaultSimilarity], result of:
0.4585254 = score(doc=552496,freq=1.0 = termFreq=1.0 ), product of:
0.16495071 = queryWeight, product of:
2.779772 = idf(docFreq=2052700, maxDocs=12169449)
0.059339657 = queryNorm
2.779772 = fieldWeight in 552496, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
2.779772 = idf(docFreq=2052700, maxDocs=12169449)
1.0 = fieldNorm(doc=552496)
Ah, I may have found the issue, but will have to rebuild the index to check
reid is defined to use IndexFieldTypes.TEXT_STORED_NOT_ANALYZED_NO_NORMS
acoustid is defined to use IndexFieldTypes.TEXT_NOT_STORED_ANALYZED_NO_NORMS
Try the following:
WildcardQuery q = new WildcardQuery(new Term("acoustid", "ae8f4538-9971-41b3-a6d0-bbca1c13e855*");
q.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_REWRITE);
Query rewritten = searcher.rewrite(q);
and look into rewritten query (via toString() or debugger).
rewritten will be boolean query made from single term query clauses reflects real index terms.
UPD: In Lucene4 middle line should be
q.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Can't give a super concrete answer here, because I don't know what ss is. I'm assuming it's layer written in your application to simplify running a lucene search and managing readers, that sort of thing.
I'm assuming ss.search looks something like: get an indexreader, open a queryparser and parse the query string, run the query, return Results that your application knows how to read.
The problem step here is the queryparser. QueryParser gets passed an analyzer, and if the analyzer doesn't match the field your searching against, you run into problems. If you analyze a GUID with StandardAnalyzer, you'll end up with a query, post analysis, that looks something like:
acoustid:"ae8f4538 9971 41b3 a6d0 bbca1c13e855"
Which doesn't match the way it appears in the index. The wildcard query works, because wildcard queries (and fuzzy queries, etc) skip analysis.
As far as why reid works, not sure, I'd have to see what ss.search looks like. However, if I were to make a bet on it, I'd bet you find a PerFieldAnalyzerWrapper, that reid has a KeywordAnalyzer set up for it, but acoustid doesn't. In that case, add acoustid to the fieldAnalyzers list with a KeywordAnalyzer, and you're good to go.
Aided by the two previous answers the problem was that the Query analyzer was different to the analyser used when indexing. But it was not a coding error but a deployment error.
When I last deployed the index there were two new fields being indexed (not the ones above) and hence the indexing code and classes that define the analysers used for indexing different fields had chnaged. But at the time I didnt deploy updated searcher code because the searcher code itself had not changed, but the indexing library that searcher code uses had changed.
I did actually try to deploy latest search code but I also had another issue regarding JAXB and Java 8/Java 10 and then prevent3ed deplyment. Since I didn't think I needed to redeploy anyway I Left it.
And since the problem was with an old field acoustid not a new field I didnt realize the issue was a new issue.
Anyway I solved the JAXB issue and redeployed with latest code base and now search is working as expected.

ElasticSearch - define custom letter order for sorting

I'm using ElasticSearch 2.4.2 (via HibernateSearch 5.7.1.Final from Java).
I have a problem with string sorting.
The language of my application has diacritics, which have a specific alphabetic
ordering. For example Ł goes directly after L, Ó goes after O, etc.
So you are supposed to sort the strings like this:
Dla
Dła
Doa
Dóa
Dza
Eza
ElasticSearch sorts by typical letters first, and moves all strange
letters to at the end:
Dla
Doa
Dza
Dła
Dóa
Eza
Can I add a custom letter ordering for ElasticSearch?
Maybe there are some plugins for this?
Do I need to write my own plugin? How do I start?
I found a plugin for Polish language for ElasticSearch,
but as I understand it is for analysing, and analysing is not a solution
in my case, because it will ignore diacritics and leave words with L and Ł mixed:
Dla
Dłb
Dlc
This would sometimes be acceptable, but is not acceptable in my specific usecase.
I will be grateful for any remarks on this.
I've never used it, but there is a plugin that could fit your needs: the ICU collation plugin.
You will have to use the icu_collation token filter, which will turns the tokens into collation keys. For that reason you will need to use a separate #Field (e.g. myField_sort) in Hibernate Search.
You can assign a specific analyzer to your field with #Field(name = "myField_sort", analyzer = #Analyzer(definition = "myCollationAnalyzer")), and define this analyzer (type, parameters) with something like that on one of your entities:
#Entity
#Indexed
#AnalyzerDef(
name = "myCollationAnalyzer",
filters = {
#TokenFilterDef(
name = "polish_collation",
factory = ElasticsearchTokenFilterFactory.class,
params = {
#Parameter(name = "type", value = "'icu_collation'"),
#Parameter(name = "language", value = "'pl'")
}
)
}
)
public class MyEntity {
See the documentation for more information: https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#_custom_analyzers
It's admittedly a bit clumsy right now, but analyzer configuration will get a bit cleaner in the next Hibernate Search version with normalizers and analyzer definition providers.
Note: as usual, your field will need to be declared as sortable (#SortableField(forField = "myField_sort")).

Get N terms with top TFIDF scores for each documents in Lucene (PyLucene)

I am currently using PyLucene but since there is no documentation for it, I guess a solution in Java for Lucene will also do (but if anyone has one in Python it would be even better).
I am working with scientific publications and for now, I retrieve the keywords of those. However, for some documents there are simply no keywords. An alternative to this would be to get N words (5-8) with the highest TFIDF scores.
I am not sure how to do it, and also when. By when, I mean : Do I have to tell Lucene at the stage of indexing to compute these values, of it is possible to do it when searching the index.
What I would like to have for each query would be something like this :
Query Ranking
Document1, top 5 TFIDF terms, Lucene score (default TFIDF)
Document2, " " , " "
...
What would also be possible is to first retrieve the ranking for the query, and then compute the top 5 TFIDF terms for each of these documents.
Does anyone have an idea how shall I do this ?
If a field is indexed, document frequencies can be retrieved with getTerms. If a field has stored term vectors, term frequencies can be retrieved with getTermVector.
I also suggest looking at MoreLikeThis, which uses tf*idf to create a query similar to the document, from which you can extract the terms.
And if you'd like a more pythonic interface, that was my motivation for lupyne:
from lupyne import engine
searcher = engine.IndexSearcher(<filepath>)
df = dict(searcher.terms(<field>, counts=True))
tf = dict(searcher.termvector(<docnum>, <field>, counts=True))
query = searcher.morelikethis(<docnum>, <field>)
After digging a bit in the mailing list, I ended up having what I was looking for.
Here is the method I came up with :
def getTopTFIDFTerms(docID, reader):
termVector = reader.getTermVector(docID, "contents");
termsEnumvar = termVector.iterator(None)
termsref = BytesRefIterator.cast_(termsEnumvar)
tc_dict = {} # Counts of each term
dc_dict = {} # Number of docs associated with each term
tfidf_dict = {} # TF-IDF values of each term in the doc
N_terms = 0
try:
while (termsref.next()):
termval = TermsEnum.cast_(termsref)
fg = termval.term().utf8ToString() # Term in unicode
tc = termval.totalTermFreq() # Term count in the doc
# Number of docs having this term in the index
dc = reader.docFreq(Term("contents", termval.term()))
N_terms = N_terms + 1
tc_dict[fg]=tc
dc_dict[fg]=dc
except:
print 'error in term_dict'
# Compute TF-IDF for each term
for term in tc_dict:
tf = tc_dict[term] / N_terms
idf = 1 + math.log(N_DOCS_INDEX/(dc_dict[term]+1))
tfidf_dict[term] = tf*idf
# Here I get a representation of the sorted dictionary
sorted_x = sorted(tfidf_dict.items(), key=operator.itemgetter(1), reverse=True)
# Get the top 5
top5 = [i[0] for i in sorted_x[:5]] # replace 5 by TOP N
I am not sure why I have to cast the termsEnum as a BytesRefIterator, I got this from a thread in the mailing list which can be found here
Hope this will help :)

Find out which field matched term in custom score script

I am using a custom score query with a multiMatchQuery. Ultimately what I want is simple and requires little explaination. In my Java Custom Score Script, I want to be able to find out which field a result matched to.
Example:
If I search Starbucks and a result comes back with the name Starbucks then I want to be able to know that name.basic was the field that matched my query. If I search for coffee and starbucks comes back I want to be able to know that tags was the field that matched.
Is there anyway to do this?
Search Query Code:
def basicSearchableSearch(t: String, lat: Double, lon: Double, r: Double, z: Int, bb: BoundingBox, max: Int): SearchResponse = {
val multiQuery = filteredQuery(
multiMatchQuery(t)
//Matches businesses and POIs
.field("name.basic").operator(Operator.OR)
.field("name.no_space")
//Businesses only
.field("tags").boost(6f),
geoBoundingBoxFilter("location")
.bottomRight(bb.botRight.y,bb.botRight.x)
.topLeft(bb.topLeft.y,bb.topLeft.x)
)
val customQuery = customScoreQuery(
multiQuery
)
.script("customJavaScript")
.lang("native")
.param("lat",lat)
.param("lon",lon)
.param("zoom",z)
global.Global.getClient().prepareSearch("searchable")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(customQuery)
.setFrom(0).setSize(max)
.execute()
.actionGet();
}
It's only simple for simple queries. On complex queries, the question which field matched is actually quite nontrivial. So, I cannot think of any efficient way to do it.
Perhaps, you could consider moving your custom score calculation closer to the match. The multi_match query is basically a shortcut for a set of match queries on the same query string combined by a dis_max query. So, you are currently building something like this:
custom_score(
filtered(
dis_max(match_1, match_2, match_3)
)
)
What you can do is to move your custom_score under dis_max and build something like this:
filtered(
dis_max(
custom_score_1(match_1),
custom_score_2(match_2),
custom_score_3(match_3)
)
)
Obviously, this will be a somewhat different query, since dis_max will operate on custom score instead of original score.

Lucene wildcard matching fails on chemical notations(?)

Using Hibernate Search Annotations (mostly just #Field(index = Index.TOKENIZED)) I've indexed a number of fields related to a persisted class of mine called Compound. I've setup text search over all the indexed fields, using the MultiFieldQueryParser, which has so far worked fine.
Among the fields indexed and searchable is a field called compoundName, with sample values:
3-Hydroxyflavone
6,4'-Dihydroxyflavone
When I search for either of these values in full the related Compound instances are returned. However problems occur when I use the partial name and introduce wildcards:
searching for 3-Hydroxyflav* still gives the correct hit, but
searching for 6,4'-Dihydroxyflav* fails to find anything.
Now as I'm quite new to Lucene / Hibernate-search, I'm not quite sure where to look at this point.. I think it might have something to do with the ' present in the second query, but I don't know how to proceed.. Should I look into Tokenizers / Analyzers / QueryParsers or something else entirely?
Or can anyone tell me how I can get the second wildcard search to match, preferably without breaking the MultiField-search behavior?
I'm using Hibernate-Search 3.1.0.GA & Lucene-core 2.9.3.
Some relevant code bits to illustrate my current approach:
Relevant parts of the indexed Compound class:
#Entity
#Indexed
#Data
#EqualsAndHashCode(callSuper = false, of = { "inchikey" })
public class Compound extends DomainObject {
#NaturalId
#NotEmpty
#Length(max = 30)
#Field(index = Index.TOKENIZED)
private String inchikey;
#ManyToOne
#IndexedEmbedded
private ChemicalClass chemicalClass;
#Field(index = Index.TOKENIZED)
private String commonName;
...
}
How I currently search over the indexed fields:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser =
new MultiFieldQueryParser(Version.LUCENE_29, searchfields, new StandardAnalyzer(Version.LUCENE_29));
FullTextSession fullTextSession = Search.getFullTextSession(getSession());
FullTextQuery fullTextQuery =
fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
Use WhitespaceAnalyzer instead of StandardAnalyzer. It will just split at whitespace, and not at commas, hyphens etc. (It will not lowercase them though, so you will need to build your own chain of whitespace + lowercase, assuming you want your search to be case-insensitive). If you need to do things differently for different fields, you can use a PerFieldAnalyzer.
You can't just set it to un-tokenized, because that will interpret your entire body of text as one token.
I think your problem is a combination of analyzer and query language problems. It is hard to say what exactly causes the problem. To find this out I recommend you inspect you index using the Lucene index tool Luke.
Since in your Hibernate Search configuration you are not using a custom analyzer the default - StandardAnalyzer - is used. This would be consistent with the fact that you use StandardAnalyzer in the constructor of MultiFieldQueryParser (always use the same analyzer for indexing and searching!). What I am not so sure of is how "6,4'-Dihydroxyflavone" gets tokenized by StandardAnalyzer. That the first thing you have to find out. For example the javadoc says:
Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and is
not split.
It might be that you need to write your own analyzer which tokenizes your chemical names the way you need it for your use cases.
Next the query parser. Make sure you understand the query syntax - Lucene query syntax. Some characters have special meaning, for example a '-'. It could be that your query is parsed the wrong way.
Either way, first step os to find out how your chemical names get tokenized. Hope that helps.
I wrote my own analyzer:
import java.util.Set;
import java.util.regex.Pattern;
import org.apache.lucene.index.memory.PatternAnalyzer;
import org.apache.lucene.util.Version;
public class ChemicalNameAnalyzer extends PatternAnalyzer {
private static Version version = Version.LUCENE_29;
private static Pattern pattern = compilePattern();
private static boolean toLowerCase = true;
private static Set stopWords = null;
public ChemicalNameAnalyzer(){
super(version, pattern, toLowerCase, stopWords);
}
public static Pattern compilePattern() {
StringBuilder sb = new StringBuilder();
sb.append("(-{0,1}\\(-{0,1})");//Matches an optional dash followed by an opening round bracket followed by an optional dash
sb.append("|");//"OR" (regex alternation)
sb.append("(-{0,1}\\)-{0,1})");
sb.append("|");//"OR" (regex alternation)
sb.append("((?<=([a-zA-Z]{2,}))-(?=([^a-zA-Z])))");//Matches a dash ("-") preceded by two or more letters and succeeded by a non-letter
return Pattern.compile(sb.toString());
}
}

Categories