how to parse search string without fields limit in lucene - java

For example:
title:lucene+((author:jack)^300.0 (bookname:how to use lucene)^200.0 (price:[100 TO 200])^100.0)~1
Is there anyWay parse the lucene query string to Query Object like Query query = Function(String queryString) in lucene?

You can use the classic QueryParser to build a parser:
QueryParser parser = new QueryParser(default_field_name, analyzer);
If you provide field names in your query string (as you do in your example), then the default field name is not used.
The analyzer should typically be the same as the analyzer which was used to build the index. For example, the StandardAnalyzer:
Analyzer analyzer = new StandardAnalyzer();
And then you can use the string containing your query as follows:
Query query = parser.parse(your_query_string);
The demo code provided as part of Lucene shows an example of this approach. See lines 118 and 135 in the SearchFiles.html code.

Related

Hibernate search on prefixes

Right now, I have successfully configured a basic Hibernate Search index to be able to search for full words on various fields of my JPA entity:
#Entity
#Indexed
class Talk {
#Field String title
#Field String summary
}
And my query looks something like this:
List<Talk> search(String text) {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager)
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Talk).get()
Query query = queryBuilder
.keyword()
.onFields("title", "summary")
.matching(text)
.createQuery()
FullTextQuery jpaQuery = fullTextEntityManager.createFullTextQuery(query, Talk)
return jpaQuery.getResultList()
}
Now I would like to fine-tune this setup so that when I search for "test" it still finds talks where title or summary contains "test" even as the prefix of another word. So talks titled "unit testing", or whose summary contains "testicle" should still appear in the search results, not just talks whose title or summary contains "test" as a full word.
I've tried to look at the documentation, but I can't figure out if I should change something to the way my entity is indexed, or whether it has something to do with the query. Note that I wanted to do something like the following, but then it's hard to search on several fields:
Query query = queryBuilder
.keyword().wildcard()
.onField("title")
.matching(text + "*")
.createQuery()
EDIT:
Based on Hardy's answer, I configured my entity like so:
#Indexed
#Entity
#AnalyzerDefs([
#AnalyzerDef(name = "ngram",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = [
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = NGramFilterFactory.class,
params = [
#Parameter(name = "minGramSize",value = "3"),
#Parameter(name = "maxGramSize",value = "3")
])
])
])
class Talk {
#Field(analyzer=#Analyzer(definition="ngram")) String title
#Field(analyzer=#Analyzer(definition="ngram")) String summary
}
Thanks to that configuration, when I search for 'arti', I get Talks where title or summary contains words whose 'arti' is a subword of (artist, artisanal, etc.). Unfortunately, after those I also get Talks where title or summary contain words that contains subwords of my search term (arts, fart, etc.). There's probably some fine-tuning to eliminate those, but at least I get results sooner now, and they are in a sensible order.
There are multiple things you can do here. A lot can be done via the proper analyzing during index time.
For example, you want to apply a stemmer appropriate for your language. For English this is generally the Snowball stemmer.The idea is that during indexing all words are reduced to their stem, testing and tested to _test for example. This gets you a bit along your way.
The other thing you can look into is ngramm indexing. According to your description you want to find matching in unrelated words as well. The idea here is to index "subwords" of each words, so that they later can be found.
Regarding analyzers you want to look at the named analyzerssection of the Hibernate Search docs. The key here is the #AnalyzerDef annotation.
On the query side you can also apply some "tricks". Indeed you can use wildcard queries, however, if you are using the Hibernate Search query DSL, you cannot use a keyword query, but you need to use a wildcard query. Again, check the Hibernate Search docs.
You should use Ngram or EdgeNGram Filter for indexin as you correctly noted in your answer. But you should use different analyzer for your queries as suggested in lucene documentation (see search_analyzer):
https://www.elastic.co/guide/en/elasticsearch/guide/current/_index_time_search_as_you_type.html
This way your search query wouldn't be tokenized to ngrams and your results would be more like %text% or text% in SQL.
Unfortunately for unknown reasons Hibernate Search currently doesn't support search_analyzer specification on fields. You can only specific analyzer for indexing, which would be also used for search query analysis.
I plan to implement this functionality myself.
EDIT:
You can specify search-time analyzer (search_analyzer) like this:
List<Talk> search(String text) {
FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager)
EntityContext entityContext = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Talk);
entityContext.overridesForField("myField", "myNamedAnalyzerDef");
QueryBuilder queryBuilder = ec.get()
Query query = queryBuilder
.keyword()
.onFields("title", "summary")
.matching(text)
.createQuery()
FullTextQuery jpaQuery = fullTextEntityManager.createFullTextQuery(query, Talk)
return jpaQuery.getResultList()
}
I have used this technique to effectively simulate Lucene search_analyzer property.
In Lucene version 4.9 I used the EnglishAnalyzer for this. I think it is a English only implementation of the SnowballAnalyzer, but not 100% certain. I used it for both creating and searching the indexes. There is nothing special needed to use it.
Analyzer analyzer = new EnglishAnalyzer(Version.LUCENE_4_9);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
and
analyzer = new EnglishAnalyzer(Version.LUCENE_4_9);
parser = new StandardQueryParser(analyzer);
You can see it in action at Guided Code Search. This runs exclusively off Lucene.
Lucene can be integrated into Hibernate searches, but I haven't yet tried to do that myself. I seems like it would be powerful, but I don't know: See Apache Luceneā„¢ Integration.
I've also read that lucene can be patched into SQL engines, but I haven't tried that either. Example: Indexing Databases with Lucene.

Match-no-documents Query object in Lucene

How do I create a new Query instance having the property of not matching any document, analogous to the MatchAllDocsQuery, but opposite?
A Boolean query without any clause returns no documents.
Query blank = new BooleanQuery();
For newer versions of Lucene you can use the Builder with the same result.
Query blank = new BooleanQuery.Builder().build();

Java/Lucene Search multiple fields for a substring

I'm Using Lucence V3.1 & Java 1.6.
I'm trying to write code (using java and lucene) that allows me to do multi-field phrase search. However, i don't want the phrase to exactly match the value in the field. What i want is to check if the phrase is actually a substring of the value in the field. I tried the following but no luck yet:
IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);
BooleanQuery booleanQuery = new BooleanQuery();
Query query1 = new TermQuery(new Term("<field-name>", "<text>"));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
Hits hits = searcher.Search(booleanQuery);
Just use quotes? Like "this is the substring". This surely works with the lucene QueryParser
If to be used in a Query use a PhraseQuery. See also http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/PhraseQuery.html
which analyzer u used while indexing??
if u used Standard Analyzer, you should not face a problem like this...
PS: always use same analyzer for both indexing and searching

JAVA Lucene not giving search results on Field?

I am creating a Lucene Document like this:
Document document = new Document();
document.add(new Field(FIELD_FOLDER_PATH,mSearchInput, Field.Store.YES, Field.Index.NOT_ANALYZED ));
Reader reader = new FileReader(file);
document.add(new Field(FIELD_CONTENTS, reader));
indexWriter.addDocument(document);
When executing Query on CONTENTS and also using wild character * I am able to fetch results:
QueryParser queryParser = new QueryParser (Version.LUCENE_36,FIELD_CONTENTS, analyzer);
Query query = queryParser.parse(searchString+"*");
But when I am using the same Query for FIELD_FOLDER_PATH , I am getting no results:
QueryParser queryParser = new QueryParser (Version.LUCENE_36,FIELD_FOLDER_PATH, analyzer);
Query query = queryParser.parse(FolderPath+"*");
However only when I am providing the exact string , I am able to fetch the results.
My Question is : Why I am not able to use (*) to fetch results in FIELD_FOLDER_PATH? Is it because of the way I am creating the field?
You should use wildcard query to support this kind of feature.
This link would help :
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_1/api/core/org/apache/lucene/search/WildcardQuery.html
So what you should do exactly is create two queries one using queryparser and other using wildcard query , then use both the queries in a BooleanQuery with "SHOULD" clause for both the queries.
for details on boolean Query visit this link :
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html

Lucene: queries and docs with multiple fields

I have a collection of documents consisting of several fields, and I need to perform queries with several terms coming from multiple fields.
What do you suggest me to use ? MultiFieldQueryParser or MultiPhraseQuery ?
thanks
How about BooleanQuery?
http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/BooleanQuery.html
Choice of Analyzer
First of all, watch out which analyzer you are using. I was stumped for a while only to realise that the StandardAnalyzer filters out common words like 'the' and 'a'. This is a problem when your field has the value 'A'. You might want to consider the KeywordAnalyzer:
See this post around the analyzer.
// Create an analyzer:
// NOTE: We want the keyword analyzer so that it doesn't strip or alter any terms:
// In our example, the Standard Analyzer removes the term 'A' because it is a common English word.
// https://stackoverflow.com/a/9071806/231860
KeywordAnalyzer analyzer = new KeywordAnalyzer();
Query Parser
Next, you can either create your query using the QueryParser:
See this post around overriding the default operator.
// Create a query parser without a default field in this example (the first argument):
QueryParser queryParser = new QueryParser("", analyzer);
// Optionally, set the default operator to be AND (we leave it the default OR):
// https://stackoverflow.com/a/9084178/231860
// queryParser.setDefaultOperator(QueryParser.Operator.AND);
// Parse the query:
Query multiTermQuery = queryParser.parse("field_name1:\"field value 1\" AND field_name2:\"field value 2\"");
Query API
Or you can achieve the same by constructing the query yourself using their API:
See this tutorial around creating the BooleanQuery.
BooleanQuery multiTermQuery = new BooleanQuery();
multiTermQuery.add(new TermQuery(new Term("field_name1", "field value 1")), BooleanClause.Occur.MUST);
multiTermQuery.add(new TermQuery(new Term("field_name2", "field value 2")), BooleanClause.Occur.MUST);
Delete the Documents that Match the Query
Then we finally pass the query to the writer to delete documents that match the query:
See my answer here, related to this answer.
See the answer to this question.
// Remove the document by using a multi key query:
// http://www.avajava.com/tutorials/lessons/how-do-i-combine-queries-with-a-boolean-query.html
writer.deleteDocuments(multiTermQuery);

Categories