Java, Lucene : Sort search results with highest hit rate.

Java, Lucene : Sort search results with highest hit rate. - java

I am working on a Spring-MVC application in which I am saving contents of user-data and using Lucene to index and search. Currently the functionality is working fine. Is it possible to sort the result with the highest matching probability first? I am currently saving paragraphs or more of text in indexes. Thank you.
Save code :
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.commit();
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
FieldType fieldType = new FieldType(TextField.TYPE_STORED);
fieldType.setTokenized(false);
if(groupNotes!=null) {
doc.add(new Field("contents", text + "\n" + tagFileName+"\n"+String.valueOf(groupNotes.getNoteNumber()), fieldType));
}else {
doc.add(new Field("contents", text + "\n" + tagFileName, fieldType));
}
Search code :
File file = new File(path.toString());
if ((file.isDirectory()) && (file.list().length > 0)) {
if(text.contains(" ")) {
String[] textArray = text.split(" ");
for(String str : textArray) {
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new WildcardQuery(new Term("contents","*"+str + "*"));
TopDocs topDocs = indexSearcher.search(query, 100);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println("Score is "+scoreDoc.score);
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
}
indexSearcher.getIndexReader().close();
directory.close();
}
}
}
}
Thank you.

Your question is not a bit very clear to me so below are just guessed answers ,
There are methods in IndexSearcher which take org.apache.lucene.search.Sort as argument ,
public TopFieldDocs search(Query query, int n,
Sort sort, boolean doDocScores, boolean doMaxScore) throws IOException OR
public TopFieldDocs search(Query query, int n, Sort sort) throws IOException
See if these methods solve your issue.
If you simply want to sort on the basis of scores then don't collect only document Ids but collect score too in a pojo that has that score field .
Collect all these pojos in some List then outside loop sort list on the basis
of score.
for (ScoreDoc hit : hits) {
//additional code
pojo.setScore(hit.score);
list.add(pojo);
}
then outside for loop ,
list.sort((POJO p1, POJO p2) -> p2
.getScore().compareTo(p1.getScore()));

Related

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}

I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);

If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Lucene: prefix query not working with WhitespaceAnalyzer

I'm experimenting a little with Lucene's diverse Query objects and I'm trying to understand why a prefix query doesn't match any documents when using a WhitespaceAnaylzer for indexing. Consider the following test code:
protected String[] ids = { "1", "2" };
protected String[] unindexed = { "Netherlands", "Italy" };
protected String[] unstored = { "Amsterdam has lots of bridges",
"Venice has lots of canals" };
protected String[] text = { "Amsterdam", "Venice" };
#Test
public void testWhitespaceAnalyzerPrefixQuery() throws IOException, ParseException {
File indexes = new File(
"C:/LuceneInActionTutorial/indexes");
FSDirectory dir = FSDirectory.open(indexes);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(
Version.LUCENE_4_9), Integer.MAX_VALUE));
IndexWriter writer = new IndexWriter(dir, config);
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Store.NO));
doc.add(new StoredField("country", unindexed[i]));
doc.add(new TextField("contents", unstored[i], Store.NO));
doc.add(new Field("city", text[i], TextField.TYPE_STORED));
writer.addDocument(doc);
}
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(dr);
QueryParser queryParser = new QueryParser(Version.LUCENE_4_9,
"contents", new WhitespaceAnalyzer(Version.LUCENE_4_9));
queryParser.setLowercaseExpandedTerms(true);
Query q = queryParser.parse("Ven*");
assertTrue(q.getClass().getSimpleName().contains("PrefixQuery"));
TopDocs hits = is.search(q, 10);
assertEquals(1, hits.totalHits);
}
If I replace the WhitespaceAnaylzer with the StandardAnalyzer the test passes though. I used Luke to inspect the index content, but couldn't find any differences in how Lucene stores the values during indexing. Could anybody please clarify what's going wrong?

StandardAnalyzer lowercases text when it is indexed. WhitespaceAnalyzer does not. The term in the index, with WhitespaceAnalyzer is "Venice".
The query parser will lowercase your query though, since you have set setLowercaseExpandedTerms(true) (this is also the default, to disable this you need to explicitly set it to false). So your query is "ven*", which does not match "Venice".

How will I go about indexing a customer using Lucene

I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.

This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.

Lucene 4.0 in text search

I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}

If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}

Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java, Lucene : Sort search results with highest hit rate. - java

Related

Prefix search using lucene

Lucene: prefix query not working with WhitespaceAnalyzer

How will I go about indexing a customer using Lucene

Lucene 4.0 in text search

Missing hits on lucene index search

Categories

Resources