Lucene: prefix query not working with WhitespaceAnalyzer - java

I'm experimenting a little with Lucene's diverse Query objects and I'm trying to understand why a prefix query doesn't match any documents when using a WhitespaceAnaylzer for indexing. Consider the following test code:
protected String[] ids = { "1", "2" };
protected String[] unindexed = { "Netherlands", "Italy" };
protected String[] unstored = { "Amsterdam has lots of bridges",
"Venice has lots of canals" };
protected String[] text = { "Amsterdam", "Venice" };
#Test
public void testWhitespaceAnalyzerPrefixQuery() throws IOException, ParseException {
File indexes = new File(
"C:/LuceneInActionTutorial/indexes");
FSDirectory dir = FSDirectory.open(indexes);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(
Version.LUCENE_4_9), Integer.MAX_VALUE));
IndexWriter writer = new IndexWriter(dir, config);
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Store.NO));
doc.add(new StoredField("country", unindexed[i]));
doc.add(new TextField("contents", unstored[i], Store.NO));
doc.add(new Field("city", text[i], TextField.TYPE_STORED));
writer.addDocument(doc);
}
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(dr);
QueryParser queryParser = new QueryParser(Version.LUCENE_4_9,
"contents", new WhitespaceAnalyzer(Version.LUCENE_4_9));
queryParser.setLowercaseExpandedTerms(true);
Query q = queryParser.parse("Ven*");
assertTrue(q.getClass().getSimpleName().contains("PrefixQuery"));
TopDocs hits = is.search(q, 10);
assertEquals(1, hits.totalHits);
}
If I replace the WhitespaceAnaylzer with the StandardAnalyzer the test passes though. I used Luke to inspect the index content, but couldn't find any differences in how Lucene stores the values during indexing. Could anybody please clarify what's going wrong?

StandardAnalyzer lowercases text when it is indexed. WhitespaceAnalyzer does not. The term in the index, with WhitespaceAnalyzer is "Venice".
The query parser will lowercase your query though, since you have set setLowercaseExpandedTerms(true) (this is also the default, to disable this you need to explicitly set it to false). So your query is "ven*", which does not match "Venice".

Related

Java, Lucene : Sort search results with highest hit rate.

I am working on a Spring-MVC application in which I am saving contents of user-data and using Lucene to index and search. Currently the functionality is working fine. Is it possible to sort the result with the highest matching probability first? I am currently saving paragraphs or more of text in indexes. Thank you.
Save code :
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.commit();
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
FieldType fieldType = new FieldType(TextField.TYPE_STORED);
fieldType.setTokenized(false);
if(groupNotes!=null) {
doc.add(new Field("contents", text + "\n" + tagFileName+"\n"+String.valueOf(groupNotes.getNoteNumber()), fieldType));
}else {
doc.add(new Field("contents", text + "\n" + tagFileName, fieldType));
}
Search code :
File file = new File(path.toString());
if ((file.isDirectory()) && (file.list().length > 0)) {
if(text.contains(" ")) {
String[] textArray = text.split(" ");
for(String str : textArray) {
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new WildcardQuery(new Term("contents","*"+str + "*"));
TopDocs topDocs = indexSearcher.search(query, 100);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println("Score is "+scoreDoc.score);
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
}
indexSearcher.getIndexReader().close();
directory.close();
}
}
}
}
Thank you.
Your question is not a bit very clear to me so below are just guessed answers ,
There are methods in IndexSearcher which take org.apache.lucene.search.Sort as argument ,
public TopFieldDocs search(Query query, int n,
Sort sort, boolean doDocScores, boolean doMaxScore) throws IOException OR
public TopFieldDocs search(Query query, int n, Sort sort) throws IOException
See if these methods solve your issue.
If you simply want to sort on the basis of scores then don't collect only document Ids but collect score too in a pojo that has that score field .
Collect all these pojos in some List then outside loop sort list on the basis
of score.
for (ScoreDoc hit : hits) {
//additional code
pojo.setScore(hit.score);
list.add(pojo);
}
then outside for loop ,
list.sort((POJO p1, POJO p2) -> p2
.getScore().compareTo(p1.getScore()));

Why my version of case insensitive Lucene keyword analyzer is not working

I am trying to index documents for case insensitive search using KeywordTokenizer.
I have created a custom Analyzer that is supposed to do keyword tokenisation as well as convert all keywords to lowercase:
public class LowercasingKeywordAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
KeywordTokenizer keywordTokenizer = new KeywordTokenizer();
return new TokenStreamComponents(keywordTokenizer, new LowerCaseFilter(keywordTokenizer));
}
}
Why does search return no results when I am submitting TermQuery with all criteria terms lowecased?? Here is a unit test reproducing the issue:
#Test
public void experiment() throws IOException, ParseException {
Analyzer analyzer = new LowercasingKeywordAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String text = "This is the text to be indexed.";
doc.add(new StringField("fieldname", text, Store.NO));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index:
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
//THE TEST PASSES WITH THE CASE SENSITIVE QUERY TERM, BUT DOES NOT PASS WITH LOWERCASED
//Query query = new TermQuery(new Term("fieldname", "This is the text to be indexed."));
Query query = new TermQuery(new Term("fieldname", "This is the text to be indexed.".toLowerCase()));
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
assertEquals(1, hits.length);
ireader.close();
directory.close();
}
Please help me to identify what is wrong here?
NOTE: I am aware of Lucene QueryParsers as well as deprecation of some interfaces, please do not bother commenting on this.
StringField is not analyzed. No analyzer you define will affect it. You can use a TextField instead, or a Field where you can define your own FieldType. Or just lowercase it before constructing the field and continue to use StringField.

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

What analyzer should I use so that I get hits for mispelled words?

I am writing a full text search functionality in my project using Lucene 4.3
Everything works just fine when i add data but when querying I only get hits only if at least one word in the query matches at least one word in the value of a field in the index.
eg if i add
private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
public static void addCustomerDoc(Map<String, String[]> parameters, String path, long customerId) throws IOException {
File file = new File(path + "/index/");
FSDirectory indexDir = FSDirectory.open(file);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
IndexWriter writer = new IndexWriter(indexDir, config);
Document doc = new Document();
doc.add(new TextField("email", parameters.get("email")[0].toString(), Field.Store.YES));
doc.add(new TextField("username", parameters.get("username")[0].toString(), Field.Store.YES));
doc.add(new TextField("phone", parameters.get("phone")[0].toString(), Field.Store.YES));
doc.add(new StringField("customerId", "" + customerId, Field.Store.YES));
addDoc(writer, doc);
writer.close();
}
private static void addDoc(IndexWriter writer, Document doc) throws IOException {
writer.addDocument(doc);
writer.commit();
}
adding a user like
username = foobar
email = foobar#example.com
phone = 0723123456
if i search for foo, fooba or foobarx i get no hits shouldn't I get a result even if I typed f or exceeded the word foobar?
If you are looking for the query parser syntax, you should look into Wildcard and
fuzzy query syntax.
You can search for a prefix with funtax like:
username:foob*
And you can use a fuzzy query instead, with:
username:foobarx~
Or, you can limit how loose fuzzy querying is, with a number between 0 and 1, higher being more restrictive, like:
username:foorbarx~0.5

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

Categories