Lucene not working when search with multiples words phrase [duplicate]

Lucene not working when search with multiples words phrase [duplicate] - java

i wanna search a string with lots of words, and retrieves documents that matches with any of them. My indexing method is the folowing:
Document document = new Document();
document.add(new TextField("termos", text, Field.Store.YES));
document.add(new TextField("docNumber",fileNumber,Field.Store.YES));
config = new IndexWriterConfig(analyzer);
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.addTokenFilter("capitalization")
.build();
config = IndexWriterConfig(analyzer);
writer = new IndexWriter(indexDirectory, config);
writer.addDocument(document);
writer.commit();
And here is my search method. I dont wanna look for specific phrase, but any of word in that. The analyzer for search is the same that for index.
Query query = new QueryBuilder(analyzer).createPhraseQuery("termos","THE_PHRASE");
String indexDir = rootProjectFolder + "/indexDir/";
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000,1000);
searcher.search(query,collector);
Im new on Lucene. Someone can help me?

Using createPhraseQuery("termos", "list of words") will precisely try to match the phrase "list of words" with a phrase slop of 0.
If you want to match any term in a list of words, you can use createBooleanQuery :
new QueryBuilder(analyzer).createBooleanQuery("termos", terms, BooleanClause.Occur.SHOULD);
As an alternative, you can also use createMinShouldMatchQuery() so that you can require a fraction of the number of query terms to match, eg. to match at least 10 percent of the terms :
new QueryBuilder(analyzer).createMinShouldMatchQuery("termos", terms, 0.1f));

Related

Lucene is not returning the results if I am searching with special characters

I am using Lucene 6.6.0 version, and I am indexing my data using StandardAnalyzer.
I am indexing following data of words.
a&e networks
a&e
After indexing , when I am searching with a&e it is not returning any results.
this is my sample code.
Directory dir = new RAMDirectory();
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
Document doc = new Document();
doc.add(new TextField("text", "a&e networks", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new TextField("text", "a&e", Field.Store.YES));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new TermQuery(new Term("text", "a&e"));
TopDocs results = searcher.search(query, 5);
final ScoreDoc[] scoreDocs = results.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
System.out.println(scoreDoc.doc + " " + scoreDoc.score + " " + searcher.doc(scoreDoc.doc).get("text"));
}
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore());
I am getting output as
Hits: 0
Max score:NaN
Even I am searching for a also it is not giving any results in this case.
but if I add stopwords set to StandardAnalyzer like this
List<String> stopWords = Arrays.asList("&");
CharArraySet stopSet = new CharArraySet(stopWords, false);
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer(stopSet));
and after that if i search for a then I am able to get the results. but even in that case also if i search for a&e , then I am not getting any results.
please suggest me how to achieve this, my goal here is if I search for a&e I should be able to get the results. do I need to any CustomAnalyzer ? If so please explain what should I add in CustomAnalyzer?

Probably & character is considered as a word boundary:
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html
This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
a and e are probably considered as stop word. So when indexed they are removed.
You can try some randomly generated keywords seperated by & character (eg. adsadaerewfds&eqeqwedasd). After indexing try to search keywords before and after &. If those keywords are found either store them without analyzing (you can use StringField) or create custom analyzer.

Lucene search match any word at phrase

i wanna search a string with lots of words, and retrieves documents that matches with any of them. My indexing method is the folowing:
Document document = new Document();
document.add(new TextField("termos", text, Field.Store.YES));
document.add(new TextField("docNumber",fileNumber,Field.Store.YES));
config = new IndexWriterConfig(analyzer);
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.addTokenFilter("capitalization")
.build();
config = IndexWriterConfig(analyzer);
writer = new IndexWriter(indexDirectory, config);
writer.addDocument(document);
writer.commit();
And here is my search method. I dont wanna look for specific phrase, but any of word in that. The analyzer for search is the same that for index.
Query query = new QueryBuilder(analyzer).createPhraseQuery("termos","THE_PHRASE");
String indexDir = rootProjectFolder + "/indexDir/";
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000,1000);
searcher.search(query,collector);
Im new on Lucene. Someone can help me?

Using createPhraseQuery("termos", "list of words") will precisely try to match the phrase "list of words" with a phrase slop of 0.
If you want to match any term in a list of words, you can use createBooleanQuery :
new QueryBuilder(analyzer).createBooleanQuery("termos", terms, BooleanClause.Occur.SHOULD);
As an alternative, you can also use createMinShouldMatchQuery() so that you can require a fraction of the number of query terms to match, eg. to match at least 10 percent of the terms :
new QueryBuilder(analyzer).createMinShouldMatchQuery("termos", terms, 0.1f));

Lucene migration text field difference in 3.0.3 and 5

I have a problem with the migration of a Lucene field from version 3.0.3 to 5.x . I prepared two JUnit test programs (one with 3.0.3 and other with 5.x) to compare the behavior.
Lucene 3:
analyzer = new StandardAnalyzer(Version.LUCENE_30);
indexWriter = new IndexWriter(dir, analyzer, true, MaxFieldLength.UNLIMITED);
....
Document doc = new Document();
doc.add(new Field("keyword", "another test#foo-bar", Field.Store.YES,
Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.commit();
....
indexReader = IndexReader.open(FSDirectory.open(path.toFile()), false);
searcher = new IndexSearcher(indexReader);
QueryParser parser = new QueryParser(Version.LUCENE_30, "keyword", analyzer);
Query query = parser.parse("test");
searcher.search(query, searcher.maxDoc());
TopDocs topDocs = searcher.search(query, searcher.maxDoc());
ScoreDoc[] hits = topDocs.scoreDocs;
doc = indexReader.document(hits[0].doc);
// doc is now NULL <- EXPECTED
assertNull(result);
The similar test with Lucene 5.x (only changed code lines):
analyzer = new StandardAnalyzer();
IndexWriterConfig indexConfig = new IndexWriterConfig(analyzer)
.setCommitOnClose(true).setOpenMode(openMode);
// create the index writer
indexWriter = new IndexWriter(dir, indexConfig);
...
// line old style (Lucene 3)
doc.add(new Field("keyword", "another test#foo-bar", Field.Store.YES,
Field.Index.ANALYZED));
// or with new field types (enable only one line)
doc.add(new TextField("keyword", "another test#foo-bar", Field.Store.YES));
...
Query query = new QueryParser(field, analyzer).parse(field + ":"
+ value);
doc = indexReader.document(hits[0].doc);
// returns a document each time
assertNull(doc); // fails!
I used the following migration document https://lucene.apache.org/core/4_8_0/MIGRATE.html to replace the Field class with the TextField class. But the search works different.
Question: How can I create the same result with the new Lucene 5.x as before with Lucene 3?
The Lucene 3 analyzer seems to split the input string on spaces only. The Lucene 5 version of the analyzer seems to split on space, '#' and '-'. :/

WildcardQuery Lucene does not work properly

I am trying to use WildCardQuery:
IndexSearcher indexSearcher = new IndexSearcher(ireader);
Term term = new Term("phrase", QueryParser.escape(partOfPhrase) + "*");
WildcardQuery wildcardQuery = new WildcardQuery(term);
LOG.debug(partOfPhrase);
Sort sort = new Sort(new SortField("freq", SortField.Type.LONG,true));
ScoreDoc[] hits = indexSearcher.search(wildcardQuery, null, 10, sort).scoreDocs;
But when I insert "san " (without quotes), I want to get something like:
"san diego", "san antonio" etc. But I am getting not only these results but also "sandals" (it must to be space after san), or juelz santana (I want to find sentences which start with san). How can I fix this issue?
EDIT
Also, if I insert "san d", I have no results.

One possible way to solve that problem - is to use another analyzer, that will not split query and text in document by space.
One of the possible analyzer - is a KeywordAnalzer, that will use whole data as a single keyword
Essential part of the test:
Directory dir = new RAMDirectory();
Analyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
later on, I could add needed docs:
Document doc = new Document();
doc.add(new TextField("text", "san diego", Field.Store.YES));
writer.addDocument(doc);
And finally, search as you want:
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Term term = new Term("text", QueryParser.escape("san ") + "*");
WildcardQuery wildcardQuery = new WildcardQuery(term);
My test is working properly, allowing me to retrieve san diego and san antonio and not take sandals. Take a look at full test here - https://github.com/MysterionRise/information-retrieval-adventure/blob/master/src/main/java/org/mystic/lucene/WildcardQueryWithSpace.java
For more information about analyzer itself - http://lucene.apache.org/core/4_10_2/analyzers-common/org/apache/lucene/analysis/core/KeywordAnalyzer.html

Lucene: Multi-word phrases as search terms

I'm trying to make a searchable phone/local business directory using Apache Lucene.
I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.
I'm indexing the data with the following:
String LocationOfDirectory = "C:\\dir\\index";
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);
w.add(doc);
w.close();
My searches work like this:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:
String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);
However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and #), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and # are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.
I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.
KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.
Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.
This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.
However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.
In the end, the correct solution was the following:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
Query q = qp.parse("grove road");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

#RikSaunderson's solution for searching documents where all subqueries of a query have to occur, is still working with Lucene 9.
QueryParser queryParser = new QueryParser(LuceneConstants.CONTENTS, new StandardAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);

If you want an exact words match the street, you could set Field "Street" NOT_ANALYZED which will not filter stop word "the".
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);

There is no need of using any Analyzer here coz Hibernate implicitly uses StandardAnalyzer which will split the words based on white spaces so the solution here is set the Analyze to NO it will automatically performs Multi Phrase Search
#Column(name="skill")
#Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
#Analyzer(definition="SkillsAnalyzer")
private String skill;

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene not working when search with multiples words phrase [duplicate] - java

Related

Lucene is not returning the results if I am searching with special characters

Lucene search match any word at phrase

Lucene migration text field difference in 3.0.3 and 5

WildcardQuery Lucene does not work properly

Lucene: Multi-word phrases as search terms

Categories

Resources