Why is Lucene sometimes not matching InChIKeys?

Why is Lucene sometimes not matching InChIKeys? - java

I have indexed my database using Hibernate Search. I use a custom analyzer, both for indexing and for querying. I have a field called inchikey that should not get tokenized. Example values are:
BBBAWACESCACAP-UHFFFAOYSA-N
KEZLDSPIRVZOKZ-AUWJEWJLSA-N
When I look into my index with Luke I can confirm that they are not tokenized, as required.
However, when I try to search them using the web app, some inchikeys are found and others are not. Curiously, for these inchikeys the search DOES work when I search without the last hyphen, as so: BBBAWACESCACAP-UHFFFAOYSA N
I have not been able to find a common element in the inchikeys that are not found.
Any idea what is going on here?
I use a MultiFieldQueryParser to search over the different fields in the database:
String[] searchfields = Compound.getSearchfields();
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, Compound.getSearchfields(), new ChemicalNameAnalyzer());
//Disable the following if search performance is too slow
parser.setAllowLeadingWildcard(true);
FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
List<Compound> hits = fullTextQuery.list();
More details about our setup have been posted here by Tim and I.

It turns out the last entries in the input file are not being indexed correctly. These ARE being tokenized. In fact, it seems they are indexed twice: once without being tokenized and once with. When I search I cannot find the un-tokenized.
I have not yet found the reason, but I think it perhaps has to do with our parser ending while Lucene is still indexing the last entries, and as a result Lucene reverting to the default analyzer (StandardAnalyzer). When I find the culprit I will report back here.
Adding #Analyzer(impl = ChemicalNameAnalyzer.class) to the fields solves the problem, but what I want is my original setup, with the default analyzer defined once, in config, like so:
<property name="hibernate.search.analyzer">path.to.ChemicalNameAnalyzer</property>

Related

How to ignore Stop words search using Lucene, when analysis of Stop Words is required?

How to ignore Stop words during Lucene Search?
I have analyzed all data including Stop Words using Custom Analyzer because it is requirement in most of the searches.
But in solution another requirement jumps in for one of module, which says to exclude Stop words from searches, on same fields, where Stop words are already Analyzed.
While analysis
#Fields({#Field(index = Index.YES, store = Store.NO, analyzer = #Analyzer(impl=CustomStopWordsAccepterAnalyzer.class)),
Now requirement say to ignore stop word when search string have "Love With Hubby" and return best score results using Love Hubby. Kindly suggest!

Once you enabled stopwords for a Field, the stopwords are effectively not encoded in the index so they can not be made to re-appear during query time.
The problem you have is quite common, as often people need to combine the score of multiple full-text queries performed with different options.
The solution is rather simple: for each property of your Java Entity, use multiple #Field annotations and assign a different index fieldname to each. This way you can target each different field with a BooleanQuery and have the scores of the output take both fields into account.

Using Apache Solr's boost query function with Spring in Java

I'm writing a Java application that is using Apache Solr to index and search through a list of articles. A requirement I am dealing with is that when a user searches for something, we are supplying a list of recommended related search terms, and the user has the option to include those extra terms in their search. The problem I'm having, however, is that we want the user's original search term to be prioritized, and results that match that should appear before results that only match related terms.
My research suggests that Solr's boost function is the solution for this, but I'm having some trouble getting it to work with Spring. The code all runs fine and I get my search results as expected, but the boost function doesn't seem to actually be re-ordering my searches at all. For example, I'm trying to do something like this:
Query query = new SimpleQuery();
Criteria searchCriteria = Criteria.where("title").contains("A").boost((float) 2);
Criteria extraCriteria = Criteria.where("title").contains("B").boost((float) 1);
query.addCriteria(searchCriteria.or(extraCriteria));
In this example I would be searching for any document whose title contains "A" or "B", but I want to boost results that match "A" to the top of the list.
I've also tried using the Extended DisMax Query Parser with a different syntax to achieve the same result, with similar lack of success. To follow the same example pattern, I'm trying to use the expression criteria as follows:
Query query = new SimpleQuery();
Criteria searchCriteria = Criteria.where("title").expression("A^2.0 OR B^1.0");
query.setDefType("edismax");
query.addCriteria(searchCriteria);
Again I would expect this to return documents with titles matching "A" or "B" but boost results matching "A", and again it simply doesn't seem to actually affect the ordering of my results at all.

Okay, I figured out the problem here. Elsewhere in the code someone else had added this snippet:
query.setPageRequest(pageable);
This was done to support pagination of the search results, but the pageable object ALSO contained some sort orders that looks like they got added to the query as part of the .setPageRequest method. Something to look out for in the future, it looks like sorts override boosting when working with Spring Solr queries in this scenario.

WildcardQuery not returning correct result

I have created an index using some data. Now I am using WildcardQuery to search this data. The documents indexed have a field name Product Code against which I am searching.
Below is the code that I am using for creating the query and searching:
Term productCodeTerm = new Term("Product Code", "*"+searchText+"*");
query = new WildcardQuery(productCodeTerm);
searcher.search(query, 100);
The searchText variable contains the search string that is used to search the index. In case when searchString is 'jf', I get the following result:
JF32358
JF5215
JF2592
Now, when I try to search using 25, or f2 or f3 or anything else other than using only j,f,jf, then the query has no hits.
I am not able to understand why it is happening. Can someone help me understand the reason the search is behaving in this way?

What analyzer did you use at indexing time? Given your examples, you should make sure that your analyzer:
does lowercasing,
does not remove digits,
does not split at boundaries between letters and digits.

In the lucene FAQ page it says :
Leading wildcards (e.g. *ook) are not supported by the QueryParser by
default. As of Lucene 2.1, they can be enabled by calling
QueryParser.setAllowLeadingWildcard( true ). Note that this can be an
expensive operation: it requires scanning the list of tokens in the
index in its entirety to look for those that match the pattern.
For more information check here.

Identify existence of keywords in document from list

I want to create a tag list for a Lucene document based on a pre-determined list.
So, if we have a document with the text
Looking for a Java programmer with experience in Lucene
and we have the keyword list (about 1000 items)
java, php, lucene, c# [...]
I want to identify that the keywords Java and Lucene exist in the document.
Just doing a java OR php OR lucene will not work because then I will not know which keyword generated the hit.
Any suggestions on how to implement this in Lucene?

I assume that you have one or more indexed fields, and you want to build your tag cloud based on the intersection of your keywords and the indexed terms for a document.
Your problem is very similar to highlighting, so the same ideas apply, you can either:
re-analyze the stored fields of your Lucene document,
use term vectors for fast access to your documents' stored fields.
Note that if you want to use term vectors, you need to enable them at compile time (see Field.TermVector.YES documentation and Field constructor).

Yes, this works
FullTextSession fts = Search.getFullTextSession(getSessionFactory().getCurrentSession());
Query q = fts.getSearchFactory().buildQueryBuilder()
.forEntity(Offer.class).get()
.keyword()
.onField("id")
.matching(myId)
.createQuery();
Object[] dId = (Object[]) fts.createFullTextQuery(q, Offer.class)
.setProjection(ProjectionConstants.DOCUMENT_ID)
.uniqueResult();
if(dId != null){
IndexReader indexReader = fts.getSearchFactory().getIndexReaderAccessor().open(Offer.class);
TermFreqVector freq = indexReader.getTermFreqVector((Integer) dId[0], "description");
}
You have to remember to index the field with TermVector.YES in your hibernate search annotation for the field.

Lucene searching by numeric values

I'm building a Java Lucene-based search system that, on addition, adds a certain number of meta-fields, one of which is a sourceId field, which denotes where the entry came from.
I'm now trying to retrieve all documents from a particular source, but the index doesn't appear to be able to find them. However, if I search for a wildcard value, the returned documents all have the correct value for this field.
The lucene query I'm using is quite simple, basically index-source-id:1 but that fails to return any hits, if I search for content:a* I get dozens of documents, all of which, when asked, return the value 1 for the index-source-id value, which is correct.
Any ideas?

I have only worked with the PHP port, however, have you checked what text analyzer you are using? This FAQ seems to indicate that like the PHP version, you need to use a diffrent one that doesn't remove digits.
You can find a list of analyzers here
Just to be sure, you have set the id to be indexable?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why is Lucene sometimes not matching InChIKeys? - java

Related

How to ignore Stop words search using Lucene, when analysis of Stop Words is required?

Using Apache Solr's boost query function with Spring in Java

WildcardQuery not returning correct result

Identify existence of keywords in document from list

Lucene searching by numeric values

Categories

Resources