Lucene: Multi-word phrases as search terms - java

I'm trying to make a searchable phone/local business directory using Apache Lucene.
I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.
I'm indexing the data with the following:
String LocationOfDirectory = "C:\\dir\\index";
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);
w.add(doc);
w.close();
My searches work like this:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:
String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);
However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and #), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and # are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.
I'm beginning to go slightly mad, does anyone know what I'm doing wrong?

The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.
KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.
Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.

I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.
This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.
However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.
In the end, the correct solution was the following:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
Query q = qp.parse("grove road");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

#RikSaunderson's solution for searching documents where all subqueries of a query have to occur, is still working with Lucene 9.
QueryParser queryParser = new QueryParser(LuceneConstants.CONTENTS, new StandardAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);

If you want an exact words match the street, you could set Field "Street" NOT_ANALYZED which will not filter stop word "the".
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);

There is no need of using any Analyzer here coz Hibernate implicitly uses StandardAnalyzer which will split the words based on white spaces so the solution here is set the Analyze to NO it will automatically performs Multi Phrase Search
#Column(name="skill")
#Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
#Analyzer(definition="SkillsAnalyzer")
private String skill;

Related

Lucene is not returning the results if I am searching with special characters

I am using Lucene 6.6.0 version, and I am indexing my data using StandardAnalyzer.
I am indexing following data of words.
a&e networks
a&e
After indexing , when I am searching with a&e it is not returning any results.
this is my sample code.
Directory dir = new RAMDirectory();
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer());
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter writer = new IndexWriter(dir, iwc);
Document doc = new Document();
doc.add(new TextField("text", "a&e networks", Field.Store.YES));
writer.addDocument(doc);
doc = new Document();
doc.add(new TextField("text", "a&e", Field.Store.YES));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Query query = new TermQuery(new Term("text", "a&e"));
TopDocs results = searcher.search(query, 5);
final ScoreDoc[] scoreDocs = results.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
System.out.println(scoreDoc.doc + " " + scoreDoc.score + " " + searcher.doc(scoreDoc.doc).get("text"));
}
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore());
I am getting output as
Hits: 0
Max score:NaN
Even I am searching for a also it is not giving any results in this case.
but if I add stopwords set to StandardAnalyzer like this
List<String> stopWords = Arrays.asList("&");
CharArraySet stopSet = new CharArraySet(stopWords, false);
IndexWriterConfig iwc = new IndexWriterConfig(new StandardAnalyzer(stopSet));
and after that if i search for a then I am able to get the results. but even in that case also if i search for a&e , then I am not getting any results.
please suggest me how to achieve this, my goal here is if I search for a&e I should be able to get the results. do I need to any CustomAnalyzer ? If so please explain what should I add in CustomAnalyzer?
Probably & character is considered as a word boundary:
https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/analysis/standard/StandardTokenizer.html
This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
a and e are probably considered as stop word. So when indexed they are removed.
You can try some randomly generated keywords seperated by & character (eg. adsadaerewfds&eqeqwedasd). After indexing try to search keywords before and after &. If those keywords are found either store them without analyzing (you can use StringField) or create custom analyzer.

Lucene not working when search with multiples words phrase [duplicate]

i wanna search a string with lots of words, and retrieves documents that matches with any of them. My indexing method is the folowing:
Document document = new Document();
document.add(new TextField("termos", text, Field.Store.YES));
document.add(new TextField("docNumber",fileNumber,Field.Store.YES));
config = new IndexWriterConfig(analyzer);
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.addTokenFilter("capitalization")
.build();
config = IndexWriterConfig(analyzer);
writer = new IndexWriter(indexDirectory, config);
writer.addDocument(document);
writer.commit();
And here is my search method. I dont wanna look for specific phrase, but any of word in that. The analyzer for search is the same that for index.
Query query = new QueryBuilder(analyzer).createPhraseQuery("termos","THE_PHRASE");
String indexDir = rootProjectFolder + "/indexDir/";
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000,1000);
searcher.search(query,collector);
Im new on Lucene. Someone can help me?
Using createPhraseQuery("termos", "list of words") will precisely try to match the phrase "list of words" with a phrase slop of 0.
If you want to match any term in a list of words, you can use createBooleanQuery :
new QueryBuilder(analyzer).createBooleanQuery("termos", terms, BooleanClause.Occur.SHOULD);
As an alternative, you can also use createMinShouldMatchQuery() so that you can require a fraction of the number of query terms to match, eg. to match at least 10 percent of the terms :
new QueryBuilder(analyzer).createMinShouldMatchQuery("termos", terms, 0.1f));

Lucene search match any word at phrase

i wanna search a string with lots of words, and retrieves documents that matches with any of them. My indexing method is the folowing:
Document document = new Document();
document.add(new TextField("termos", text, Field.Store.YES));
document.add(new TextField("docNumber",fileNumber,Field.Store.YES));
config = new IndexWriterConfig(analyzer);
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.addTokenFilter("capitalization")
.build();
config = IndexWriterConfig(analyzer);
writer = new IndexWriter(indexDirectory, config);
writer.addDocument(document);
writer.commit();
And here is my search method. I dont wanna look for specific phrase, but any of word in that. The analyzer for search is the same that for index.
Query query = new QueryBuilder(analyzer).createPhraseQuery("termos","THE_PHRASE");
String indexDir = rootProjectFolder + "/indexDir/";
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(1000,1000);
searcher.search(query,collector);
Im new on Lucene. Someone can help me?
Using createPhraseQuery("termos", "list of words") will precisely try to match the phrase "list of words" with a phrase slop of 0.
If you want to match any term in a list of words, you can use createBooleanQuery :
new QueryBuilder(analyzer).createBooleanQuery("termos", terms, BooleanClause.Occur.SHOULD);
As an alternative, you can also use createMinShouldMatchQuery() so that you can require a fraction of the number of query terms to match, eg. to match at least 10 percent of the terms :
new QueryBuilder(analyzer).createMinShouldMatchQuery("termos", terms, 0.1f));

PhraseQuery is not working in Apache lucene 7.2.1

I am new to the Apache Lucene. I am using the Apache Lucene v7.2.1.
I need to do a phrase search in a huge file. I first made a sample code to figure out phrase search functionality in the Lucene using PhraseQuery. But it does not work.
My code is given below:
public class LuceneExample
{
private static final String INDEX_DIR = "myIndexDir";
// function to create index writer
private static IndexWriter createWriter() throws IOException
{
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(dir, config);
return writer;
}
// function to create the index document.
private static Document createDocument(Integer id, String source, String target)
{
Document document = new Document();
document.add(new StringField("id", id.toString() , Store.YES));
document.add(new TextField("source", source , Store.YES));
document.add(new TextField("target", target , Store.YES));
return document;
}
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
{
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
String[] words = source.split(" ");
int ii = 0;
for (String word : words) {
builder.add(new Term("source", word), ii);
ii = ii + 1;
}
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
public static void main(String[] args) throws Exception
{
// TODO Auto-generated method stub
// create index writer
IndexWriter writer = createWriter();
//create documents object
List<Document> documents = new ArrayList<>();
String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
Document d1 = createDocument(1, src, tgt);
documents.add(d1);
src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
Document d2 = createDocument(2, src, tgt);
documents.add(d2);
writer.deleteAll();
// adding documents to index writer
writer.addDocuments(documents);
writer.commit();
writer.close();
// for index searching
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
//Search by source
TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
System.out.println("Total Results count :: " + foundDocs.totalHits);
}
}
When I searched for the string "benefit of an individual" as mentioned above. The Total Results count comes as '0' . But it is present in the document1. It would be great if I could get any help in resolving this issue.
Thanks in advance.
Let's start from the summary:
at index time you are using Standard analyzer with English stop words
at query time you are using your own analysis without stop words and special characters removal
There is a rule use the same analysis chain at index and query time.
Here is an example of a simplified and "correct" query processing:
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("source", charTermAttribute.toString()));
}
tokenStream.end();
tokenStream.close();
builder.setSlop(2);
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
In sake of simplicity we can remove stop words from Standard analyzer, by using constructor with empty stop words list - and everything will be simple as you expected. You can read more about stop words and phrase queries here.
All the problems with phrase queries are started from stop words. Under the hood Lucene keeps positions of all words including stop words in a special index -
term positions. It is useful in some cases to separate "the goal" and "goal". In case of phrase query - it tries to take into account term positions. For example, we have a term "black and white" with a stop word "and". In this case Lucene index will have two terms "black" with position 1 and "white" with position 3. Naive phrase query "black white" should not match anything because it does not allow gap in terms positions. There are two possible strategies to create the right query:
"black ? white" - uses special marker for every stop word. This will match "black and white" and "black or white"
"black white"~1 - allows to match with gap in terms positions. "black or white" is also possible. With slop 2 and more "white and black" is also possible.
In order to create the right query you can use the following term attribute at query processing:
PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
I've used setSlop(2) in order to simplify a code snippet, you can set slop factor based on query length or put correct positions of terms in phrase builder. My recommendation is not to use stop words, you can read about stop words here.

Apache Lucene - Creating and Storing an Index?

This post if a follow-on from my previous question:
Apache Lucene - Optimizing Searching
I want to create an index from title stored in my database, store the index on the server from which I am running my web application, and have that index available to all users who are using the search feature on the web application.
I will update the index when a new title is added, edited or deleted.
I cannot find a tutorial to do this in Apache Lucene, so can anyone help me code this in Java (using Spring).
From my understanding to your question, you need to do the following :
1) Index you data (titles in your case)
first you need to implement the code that create that index for you data, check this sample of code.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
//Directory directory = new RAMDirectory();
Store an index on disk
Directory directory = FSDirectory.open(indexfilesDirPathOnYourServer);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String title = getTitle();
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
here you need to loop over you all data.
2) Search for you indexed data.
you can search for you data by using this code:
DirectoryReader ireader = DirectoryReader.open(indexfilesDirPathOnYourServer);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);//note here we used the same analyzer object
Query query = parser.parse("test");//test is am example for a search query
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println(hitDoc.get("fieldname"));
}
ireader.close();
directory.close();
Note : here you don't have to fetch all the data from your DB, you can directly get it from index. also you don't have to re-create the whole index each time user search or fetch the data, you can update the title from time to time when you add/update or delete one by one (the title that have been updated or deleted not the whole indexed titles).
to update index use :
Term keyTerm = new Term(KEY_FIELD, KEY_VALUE);
iwriter.updateDocument(keyTerm, updatedFields);
to delete index use :
Term keyTerm = new Term(KEY_FIELD, KEY_VALUE);
iwriter.deleteDocuments(keyTerm);
Hope that help you.

Categories