Lucene: queries and docs with multiple fields

Lucene: queries and docs with multiple fields - java

I have a collection of documents consisting of several fields, and I need to perform queries with several terms coming from multiple fields.
What do you suggest me to use ? MultiFieldQueryParser or MultiPhraseQuery ?
thanks

How about BooleanQuery?
http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html

Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See my answer here, related to this answer.
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Related

How can I get the highlights of my result set in Hibernate search 6?

I am using Hibernate search 6 Lucne backend in my java application.
There are various search operations I am performing including a fuzzy search.
I get search results without any issues.
Now I want to show what are the causes to pick each result in my result list.
Let's say the search keyword is "test", and the fuzzy search is performed in the fields "name", "description", "Id" etc. And I get 10 results in a List. Now I want to highlight the values in the fields of each result which caused that result to be a matching result.
eg: Consider the below to be one of the items in the search result List object. (for clarity I have written it in JSON format)
{
name:"ABC some test name",
description: "this is a test element",
id: "abc123"
}
As the result suggests it's been picked as a search result because the keyword "test" is there in both the fields "name" and the "description". I want to highlight those specific fields in the frontend when I show the search results.
Currently, I am retrieving search results through a java REST API to my Angular frontend. How can I get those specific fields and their values using Hibernate search 6 in my java application?
So far I have gone through Hibernate search 6 documentation and found nothing. (https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#preface) Also looked at what seemed to be related issues on the web over the past week and got nothing so far. It seems like m requirement is a little specific and that's why I need your help here.

Highlighting is not yet implemented in Hibernate Search, see HSEARCH-2192.
That being said, you can leverage native Elasticsearch / Lucene APIs.
With Elasticsearch it's relatively easy: you can use a request transformer to add a highlight element to the HTTP request, then use the jsonHit projection to retrieve the JSON for each hit, which contains a highlight element that includes the highlighted fields and the highlighted fragments.
With Lucene it would be more complex and you'll have to rely on unsupported features, but that's doable.
Retrieve the Lucene Query from your Hibernate Search predicate:
SearchPredicate predicate = ...;
Query query = LuceneMigrationUtils.toLuceneQuery(predicate);
Then do the highlighting: Hibernate search highlighting not analyzed fields may help with that, so that code uses an older version of Lucene and you might have to adapt it:
String highlightText(Query query, Analyzer analyzer, String fieldName, String text) {
QueryScorer queryScorer = new QueryScorer(query);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span>", "</span>");
Highlighter highlighter = new Highlighter(formatter, queryScorer);
return highlighter.getBestFragment(analyzer, fieldName, text);
}
You'll need to add a depdency to org.apache.lucene:lucene-highlighter.
To retrieve the analyzer, use the Hibernate Search metadata: https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#backend-lucene-access-analyzers
So, connecting the dots... something like that?
Highlighter createHighlighter(SearchPredicate predicate, SearchScope<?> scope) {
// Taking a shortcut here to retrieve the index manager,
// since we already have the scope
// WARNING: This only works when searching a single index
Analyzer analyzer = scope.includedTypes().iterator().next().indexManager()
.unwrap( LuceneIndexManager.class )
.searchAnalyzer();
// WARNING: this method is not supported and might disappear in future versions of HSearch
Query query = LuceneMigrationUtils.toLuceneQuery(predicate);
QueryScorer queryScorer = new QueryScorer(query);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("<span>", "</span>");
return new Highlighter(formatter, queryScorer);
}
SearchSession searchSession = Search.session( entityManager );
SearchScope<Book> scope = searchSession.scope( Book.class );
SearchPredicate predicate = scope.predicate().match()
.fields( "title", "authors.name" )
.matching( "refactoring" )
.toPredicate();
Highlighter highlighter = createHighlighter(predicate, scope);
// Using Pair from Apache Commons, but others would work just as well
List<Pair<Book, String>> hits = searchSession.search( scope )
.select( select( f -> f.composite(
// Highlighting the title only, but you can do the same for other fields
book -> Pair.of( book, highlighter.getBestFragment(analyzer, "title", book.getTitle()))
f.entity()
) )
.where( predicate )
.fetch( 20 );
Not sure this compiles, but that should get you started.
Relatedly, but not exactly what you're asking for, there's an explain feature to get a sense of why a given hit has a given score: https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#search-dsl-query-explain

Hibernate search on prefixes

Right now, I have successfully configured a basic Hibernate Search index to be able to search for full words on various fields of my JPA entity:
#Entity
#Indexed
class Talk {
#Field String title
#Field String summary
}
And my query looks something like this:
List<Talk> search(String text) {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager)
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Talk).get()
Query query = queryBuilder
.keyword()
.onFields("title", "summary")
.matching(text)
.createQuery()
FullTextQuery jpaQuery = fullTextEntityManager.createFullTextQuery(query, Talk)
return jpaQuery.getResultList()
}
Now I would like to fine-tune this setup so that when I search for "test" it still finds talks where title or summary contains "test" even as the prefix of another word. So talks titled "unit testing", or whose summary contains "testicle" should still appear in the search results, not just talks whose title or summary contains "test" as a full word.
I've tried to look at the documentation, but I can't figure out if I should change something to the way my entity is indexed, or whether it has something to do with the query. Note that I wanted to do something like the following, but then it's hard to search on several fields:
Query query = queryBuilder
.keyword().wildcard()
.onField("title")
.matching(text + "*")
.createQuery()
EDIT:
Based on Hardy's answer, I configured my entity like so:
#Indexed
#Entity
#AnalyzerDefs([
#AnalyzerDef(name = "ngram",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = [
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = NGramFilterFactory.class,
params = [
#Parameter(name = "minGramSize",value = "3"),
#Parameter(name = "maxGramSize",value = "3")
])
])
])
class Talk {
#Field(analyzer=#Analyzer(definition="ngram")) String title
#Field(analyzer=#Analyzer(definition="ngram")) String summary
}
Thanks to that configuration, when I search for 'arti', I get Talks where title or summary contains words whose 'arti' is a subword of (artist, artisanal, etc.). Unfortunately, after those I also get Talks where title or summary contain words that contains subwords of my search term (arts, fart, etc.). There's probably some fine-tuning to eliminate those, but at least I get results sooner now, and they are in a sensible order.

There are multiple things you can do here. A lot can be done via the proper analyzing during index time.
For example, you want to apply a stemmer appropriate for your language. For English this is generally the Snowball stemmer.The idea is that during indexing all words are reduced to their stem, testing and tested to _test for example. This gets you a bit along your way.
The other thing you can look into is ngramm indexing. According to your description you want to find matching in unrelated words as well. The idea here is to index "subwords" of each words, so that they later can be found.
Regarding analyzers you want to look at the named analyzerssection of the Hibernate Search docs. The key here is the #AnalyzerDef annotation.
On the query side you can also apply some "tricks". Indeed you can use wildcard queries, however, if you are using the Hibernate Search query DSL, you cannot use a keyword query, but you need to use a wildcard query. Again, check the Hibernate Search docs.

You should use Ngram or EdgeNGram Filter for indexin as you correctly noted in your answer. But you should use different analyzer for your queries as suggested in lucene documentation (see search_analyzer):
https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html
This way your search query wouldn't be tokenized to ngrams and your results would be more like %text% or text% in SQL.
Unfortunately for unknown reasons Hibernate Search currently doesn't support search_analyzer specification on fields. You can only specific analyzer for indexing, which would be also used for search query analysis.
I plan to implement this functionality myself.
EDIT:
You can specify search-time analyzer (search_analyzer) like this:
List<Talk> search(String text) {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager)
EntityContext entityContext = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Talk);
entityContext.overridesForField("myField", "myNamedAnalyzerDef");
QueryBuilder queryBuilder = ec.get()
Query query = queryBuilder
.keyword()
.onFields("title", "summary")
.matching(text)
.createQuery()
FullTextQuery jpaQuery = fullTextEntityManager.createFullTextQuery(query, Talk)
return jpaQuery.getResultList()
}
I have used this technique to effectively simulate Lucene search_analyzer property.

In Lucene version 4.9 I used the EnglishAnalyzer for this. I think it is a English only implementation of the SnowballAnalyzer, but not 100% certain. I used it for both creating and searching the indexes. There is nothing special needed to use it.
Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_4_9);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
and
analyzer = new EnglishAnalyzer(Version.LUCENE_4_9);
parser = new StandardQueryParser(analyzer);
You can see it in action at Guided Code Search. This runs exclusively off Lucene.
Lucene can be integrated into Hibernate searches, but I haven't yet tried to do that myself. I seems like it would be powerful, but I don't know: See Apache Lucene™ Integration.
I've also read that lucene can be patched into SQL engines, but I haven't tried that either. Example: Indexing Databases with Lucene.

Java/Lucene Search multiple fields for a substring

I'm Using Lucence V3.1 & Java 1.6.
I'm trying to write code (using java and lucene) that allows me to do multi-field phrase search. However, i don't want the phrase to exactly match the value in the field. What i want is to check if the phrase is actually a substring of the value in the field. I tried the following but no luck yet:
IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);
BooleanQuery booleanQuery = new BooleanQuery();
Query query1 = new TermQuery(new Term("<field-name>", "<text>"));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
Hits hits = searcher.Search(booleanQuery);

Just use quotes? Like "this is the substring". This surely works with the lucene QueryParser
If to be used in a Query use a PhraseQuery. See also http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/PhraseQuery.html

which analyzer u used while indexing??
if u used Standard Analyzer, you should not face a problem like this...
PS: always use same analyzer for both indexing and searching

How do I use boolean operators with Hibernate Search

I'm learning the Hibernate Search Query DSL, and I'm not sure how to construct queries using boolean arguments such as AND or OR.
For example, let's say that I want to return all person records that have a firstName value of "bill" or "bob".
Following the hibernate docs, one example uses the bool() method w/ two subqueries, such as:
QueryBuilder b = fts.getSearchFactory().buildQueryBuilder().forEntity(Person.class).get();
Query luceneQuery = b.bool()
.should(b.keyword().onField("firstName").matching("bill").createQuery())
.should(b.keyword().onField("firstName").matching("bob").createQuery())
.createQuery();
logger.debug("query 1:{}", luceneQuery.toString());
This ultimately produces the lucene query that I want, but is this the proper way to use boolean logic with hibernate search? Is "should()" the equivalent of "OR" (similarly, does "must()" correspond to "AND")?.
Also, writing a query this way feels cumbersome. For example, what if I had a collection of firstNames to match against? Is this type of query a good match for the DSL in the first place?

Yes your example is correct. The boolean operators are called should instead of OR because of the names they have in the Lucene API and documentation, and because it is more appropriate: it is not only influencing a boolean decision, but it also affects scoring of the result.
For example if you search for cars "of brand Fiat" OR "blue", the cars branded Fiat AND blue will also be returned and having an higher score than those which are blue but not Fiat.
It might feel cumbersome because it's programmatic and provides many detailed options. A simpler alternative is to use a simple string for your query and use the QueryParser to create the query. Generally the parser is useful to parse user input, the programmatic one is easier to deal with well defined fields; for example if you have the collection you mentioned it's easy to build it in a for loop.

You can also use BooleanQuery. I would prefer this beacuse You can use this in loop of a list.
org.hibernate.search.FullTextQuery hibque = null;
org.apache.lucene.search.BooleanQuery bquery = new BooleanQuery();
QueryBuilder qb = fulltextsession.getSearchFactory().buildQueryBuilder()
.forEntity(entity.getClass()).get();
for (String keyword : list) {
bquery.add(qb.keyword().wildcard().onField(entityColumn).matching(keyword)
.createQuery() , BooleanClause.Occur.SHOULD);
}
if (!filterColumn.equals("") && !filterValue.equals("")) {
bquery.add(qb.keyword().wildcard().onField(column).matching(value).createQuery()
, BooleanClause.Occur.MUST);
}
hibque = fulltextsession.createFullTextQuery(bquery, entity.getClass());
int num = hibque.getResultSize();

To answer you secondary question:
For example, what if I had a collection of firstNames to match against?
I'm not an expert, but according to (the third example from the end of) 5.1.2.1. Keyword queries in Hibernate Search Documentation, you should be able to build the query like so:
Collection<String> namesCollection = getNames(); // Contains "billy" and "bob", for example
StringBuilder names = new StringBuilder(100);
for(String name : namesCollection) {
names.append(name).append(" "); // Never mind the space at the end of the resulting string.
}
QueryBuilder b = fts.getSearchFactory().buildQueryBuilder().forEntity(Person.class).get();
Query luceneQuery = b.bool()
.should(
// Searches for multiple possible values in the same field
b.keyword().onField("firstName").matching( sb.toString() ).createQuery()
)
.must(b.keyword().onField("lastName").matching("thornton").createQuery())
.createQuery();
and, have as a result, Persons with (firstName preferably "billy" or "bob") AND (lastName = "thornton"), although I don't think it will give the good ol' Billy Bob Thornton a higher score ;-).

I was looking for the same issue and have a somewhat different issue than presented. I was looking for an actual OR junction. The should case didn't work for me, as results that didn't pass any of the two expressions, but with a lower score. I wanted to completely omit these results. You can however create an actual boolean OR expression, using a separate boolean expression for which you disable scoring:
val booleanQuery = cb.bool();
val packSizeSubQuery = cb.bool();
packSizes.stream().map(packSize -> cb.phrase()
.onField(LUCENE_FIELD_PACK_SIZES)
.sentence(packSize.name())
.createQuery())
.forEach(packSizeSubQuery::should);
booleanQuery.must(packSizeSubQuery.createQuery()).disableScoring();
fullTextEntityManager.createFullTextQuery(booleanQuery.createQuery(), Product.class)
return persistenceQuery.getResultList();

How to do a Multi field - Phrase search in Lucene?

Title asks it all... I want to do a multi field - phrase search in Lucene.. How to do it ?
for example :
I have fields as String s[] = {"title","author","content"};
I want to search harry potter across all fields.. How do I do it ?
Can someone please provide an example snippet ?

Use MultiFieldQueryParser, its a QueryParser which constructs queries to search multiple fields..
Other way is to use Create a BooleanQuery consisting of TermQurey (in your case phrase query).
Third way is to include the content of other fields into your default content field.
Add
Generally speaking, querying on multiple fields isn’t the best practice for user-entered queries. More commonly, all words you want searched are indexed into a contents or keywords field by combining various fields.
Update
Usage:
Query query = MultiFieldQueryParser.parse(Version.LUCENE_30, new String[] {"harry potter","harry potter","harry potter"}, new String[] {"title","author","content"},new SimpleAnalyzer());
IndexSearcher searcher = new IndexSearcher(...);
Hits hits = searcher.search(query);
The MultiFieldQueryParser will resolve the query in this way: (See javadoc)
Parses a query which searches on the
fields specified. If x fields are
specified, this effectively
constructs:
(field1:query1) (field2:query2)
(field3:query3)...(fieldx:queryx)
Hope this helps.

intensified googling revealed this :
http://lucene.472066.n3.nabble.com/Phrase-query-on-multiple-fields-td2292312.html.
Since it is latest and best, I'll go with his approach I guess.. Nevertheless, it might help someone who is looking for something like I am...

You need to use MultiFieldQueryParser with escaped string. I have tested it with Lucene 8.8.1 and it's working like magic.
String queryStr = "harry potter";
queryStr = "\"" + queryStr.trim() + "\"";
Query query = new MultiFieldQueryParser(new String[]{"title","author","content"}, new StandardAnalyzer()).parse(queryStr);
System.out.println(query);
It will print.
(title:"harry potter") (author:"harry potter") (content:"harry potter")

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.