Prefix search using lucene - java

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}

I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);

If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Related

Lucene: prefix query not working with WhitespaceAnalyzer

I'm experimenting a little with Lucene's diverse Query objects and I'm trying to understand why a prefix query doesn't match any documents when using a WhitespaceAnaylzer for indexing. Consider the following test code:
protected String[] ids = { "1", "2" };
protected String[] unindexed = { "Netherlands", "Italy" };
protected String[] unstored = { "Amsterdam has lots of bridges",
"Venice has lots of canals" };
protected String[] text = { "Amsterdam", "Venice" };
#Test
public void testWhitespaceAnalyzerPrefixQuery() throws IOException, ParseException {
File indexes = new File(
"C:/LuceneInActionTutorial/indexes");
FSDirectory dir = FSDirectory.open(indexes);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(
Version.LUCENE_4_9), Integer.MAX_VALUE));
IndexWriter writer = new IndexWriter(dir, config);
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Store.NO));
doc.add(new StoredField("country", unindexed[i]));
doc.add(new TextField("contents", unstored[i], Store.NO));
doc.add(new Field("city", text[i], TextField.TYPE_STORED));
writer.addDocument(doc);
}
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(dr);
QueryParser queryParser = new QueryParser(Version.LUCENE_4_9,
"contents", new WhitespaceAnalyzer(Version.LUCENE_4_9));
queryParser.setLowercaseExpandedTerms(true);
Query q = queryParser.parse("Ven*");
assertTrue(q.getClass().getSimpleName().contains("PrefixQuery"));
TopDocs hits = is.search(q, 10);
assertEquals(1, hits.totalHits);
}
If I replace the WhitespaceAnaylzer with the StandardAnalyzer the test passes though. I used Luke to inspect the index content, but couldn't find any differences in how Lucene stores the values during indexing. Could anybody please clarify what's going wrong?
StandardAnalyzer lowercases text when it is indexed. WhitespaceAnalyzer does not. The term in the index, with WhitespaceAnalyzer is "Venice".
The query parser will lowercase your query though, since you have set setLowercaseExpandedTerms(true) (this is also the default, to disable this you need to explicitly set it to false). So your query is "ven*", which does not match "Venice".

Fetch Searched Data/Metadata In Lucene

Hi I am java developer and learning Lucene. I have a java class that index a pdf(lucene_in_action_2nd_edition.pdf) file and a search class that search some text from index. IndexSearcher is giving Document which shows that string exists in index(lucene_in_action_2nd_edition.pdf) or not.
But now I want to get searched data or metadata. i.e. I want to know that at which page string is matched, or few text around matched string, etc... How to do that?
Here is my LuceneSearcher.java class:
public static void main(String[] args) throws Exception {
File indexDir = new File("D:\\index");
String querystr = "Advantages of FastVectorHighlighter";
Query q = new QueryParser(Version.LUCENE_40, "contents",
new StandardAnalyzer(Version.LUCENE_40)).parse(querystr);
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + "... " + d.get("filename"));
System.out.println("=====================================================");
System.out.println(d.get("contents"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
Here d.get("contents") give full text(generated by Tika) of .pdf file, that was stored at time of indexing.
I want some information about searched text, so that I can show that on my web page or highlight searched text properly(like google search output). How to achieve that? Do we need to write some logic or Lucene does it internally?
Any type of help would be appreciated. Thanks in advance.
The org.apache.lucene.search.highlight package provides this functionality.
Such as:
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String text = doc.get("contents");
String bestFrag = highlighter.getBestFragment(analyzer, "contents", text);
//output, however you like.
You can also get a list of best Fragments from the highlighter, instead of just a single one, if you prefer, see the Highlighter API

How will I go about indexing a customer using Lucene

I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.
This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.

Lucene 4.0 in text search

I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

Categories