Lucene Search with Date parameter - java

I fairly new to Lucene framework. We are trying to implement Lucene framework since we need to search a LARGE amount of data within few milliseconds.
Scenario:
We have EmployeeDto which we have indexed in Lucene. For below
example, I have hardcoded only 6 values.
I have 2 arguments which should act as input parameters to the search
query.
EmployeeDto.java
private String firstName;
private String lastName;
private Long employeeId;
private Integer salary;
private Date startDate;
private Date terminationDate;
//getters and setters
EmployeeLucene.java
public class EmployeeLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
final DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd");
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
long starttimeOfLoad = Calendar.getInstance().getTimeInMillis();
System.out.println("Data Loading started");
addEmployee(w, new EmployeeDto("John", "Smith", new Long(101), 10000, DATE_FORMAT.parse("2010-05-05"), DATE_FORMAT.parse("2018-05-05")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 12000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2015-03-10")));
addEmployee(w, new EmployeeDto("Franklin", "Robinson", new Long(102), 12000, DATE_FORMAT.parse("2011-04-04"), DATE_FORMAT.parse("2015-07-07")));
addEmployee(w, new EmployeeDto("Thomas", "Boone", new Long(102), 12000, DATE_FORMAT.parse("2011-02-02"), DATE_FORMAT.parse("2015-03-10")));
addEmployee(w, new EmployeeDto("John", "Smith", new Long(103), 13000, DATE_FORMAT.parse("2019-05-05"), DATE_FORMAT.parse("2099-12-31")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 14000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2099-12-31")));
w.close();
System.out.println("Data Loaded. Completed in " + (Calendar.getInstance().getTimeInMillis() - starttimeOfLoad));
// 2. query
Query q = null;
try {
q = new QueryParser(Version.LUCENE_40, "fullName", analyzer).parse(args[0] + "*");
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
long starttime = Calendar.getInstance().getTimeInMillis();
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
List<EmployeeDto> employeeDtoList = new ArrayList<EmployeeDto>();
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
employeeDtoList.add(new EmployeeDto(d.get("firstName"), d.get("lastName"), Long.valueOf(d.get("employeeId")),
Integer.valueOf(d.get("salary"))));
}
System.out.println(employeeDtoList.size());
System.out.println(employeeDtoList);
System.out.println("Time taken:" + (Calendar.getInstance().getTimeInMillis() - starttime) + " ms");
}
private static void addEmployee(IndexWriter w, EmployeeDto employeeDto) throws IOException, ParseException {
Document doc = new Document();
doc.add(new TextField("fullName", employeeDto.getFirstName() + " " + employeeDto.getLastName(), Field.Store.YES));
doc.add(new TextField("firstName", employeeDto.getFirstName(), Field.Store.YES));
doc.add(new TextField("lastName", employeeDto.getLastName(), Field.Store.YES));
doc.add(new LongField("employeeId", employeeDto.getEmployeeId(), Field.Store.YES));
doc.add(new LongField("salary", employeeDto.getSalary(), Field.Store.YES));
doc.add(new LongField("startDate", employeeDto.getStartDate().getTime(), Field.Store.YES));
doc.add(new LongField("terminationDate", employeeDto.getTerminationDate().getTime(), Field.Store.YES));
w.addDocument(doc);
}
}
I run the program as "java EmployeeLucene thom 2014-05-05".
I should get only 2 values. but getting 3 hits.
Questions:
How to include the 2nd param in the Query string? 2nd param
should be greater than 'startDate' and lesser than 'terminationDate'
Can we include EmployeeDto itself inside the document to avoid
creation of List of EmployeeDtos once we get the hits.

First, you're going to get three results because you have three records with a full name that contains the string "thom*". They are records 2, 4, and 6.
Second, Lucene version 4.0 is really old.
Finally, one way to query for a date between startDate and terminationDate is as follows:
// 2. query
BooleanQuery finalQuery = null;
try {
// final query
finalQuery = new BooleanQuery();
// thom* query
Query fullName = new QueryParser(Version.LUCENE_40, "fullName", analyzer).parse("thom" + "*");
finalQuery.add(fullName, Occur.MUST); // MUST implies that the keyword must occur.
// greaterStartDate query
long searchDate = DATE_FORMAT.parse("2014-05-05").getTime();
Query greaterStartDate = NumericRangeQuery.newLongRange("startDate", null, searchDate, true, true);
finalQuery.add(greaterStartDate, Occur.MUST); // Using all "MUST" occurs is equivalent to "AND" operator
// lessTerminationDate query
Query lessTerminationDate = NumericRangeQuery.newLongRange("terminationDate", searchDate, null, false, false);
finalQuery.add(lessTerminationDate, Occur.MUST);
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
Can we include EmployeeDto itself inside the document to avoid creation of List of EmployeeDtos once we get the hits.
Not that I'm aware of.
EDIT: Version 7.0.1
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
final DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd");
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
long starttimeOfLoad = Calendar.getInstance().getTimeInMillis();
System.out.println("Data Loading started");
addEmployee(w, new EmployeeDto("John", "Smith", new Long(101), 10000, DATE_FORMAT.parse("2010-05-05"), DATE_FORMAT.parse("2018-05-05")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 12000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2015-10-10")));
addEmployee(w, new EmployeeDto("Franklin", "Robinson", new Long(102), 12000, DATE_FORMAT.parse("2011-04-04"), DATE_FORMAT.parse("2015-07-07")));
addEmployee(w, new EmployeeDto("Thomas", "Boone", new Long(102), 12000, DATE_FORMAT.parse("2011-02-02"), DATE_FORMAT.parse("2015-03-10")));
addEmployee(w, new EmployeeDto("John", "Smith", new Long(103), 13000, DATE_FORMAT.parse("2019-05-05"), DATE_FORMAT.parse("2099-12-31")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 14000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2099-12-31")));
w.close();
System.out.println("Data Loaded. Completed in " + (Calendar.getInstance().getTimeInMillis() - starttimeOfLoad));
// 2. query
BooleanQuery finalQuery = null;
try {
// final query
Builder builder = new BooleanQuery.Builder();
// thom* query
Query fullName = new QueryParser("fullName", analyzer).parse("thom" + "*");
builder.add(fullName, Occur.MUST); // MUST implies that the keyword must occur.
// greaterStartDate query
long searchDate = DATE_FORMAT.parse("2014-05-05").getTime();
Query greaterStartDate = LongPoint.newRangeQuery("startDatePoint", Long.MIN_VALUE, searchDate);
builder.add(greaterStartDate, Occur.MUST); // Using all "MUST" occurs is equivalent to "AND" operator
// lessTerminationDate query
Query lessTerminationDate = LongPoint.newRangeQuery("terminationDatePoint", searchDate, Long.MAX_VALUE);
builder.add(lessTerminationDate, Occur.MUST);
finalQuery = builder.build();
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
long starttime = Calendar.getInstance().getTimeInMillis();
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
searcher.search(finalQuery, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
List<EmployeeDto> employeeDtoList = new ArrayList<EmployeeDto>();
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
employeeDtoList.add(new EmployeeDto(d.get("firstName"), d.get("lastName"), Long.valueOf(d.get("employeeId")),
Integer.valueOf(d.get("salary"))));
}
System.out.println(employeeDtoList.size());
System.out.println(employeeDtoList);
System.out.println("Time taken:" + (Calendar.getInstance().getTimeInMillis() - starttime) + " ms");
}
private static void addEmployee(IndexWriter w, EmployeeDto employeeDto) throws IOException {
Document doc = new Document();
doc.add(new TextField("fullName", employeeDto.getFirstName() + " " + employeeDto.getLastName(), Store.YES));
doc.add(new TextField("firstName", employeeDto.getFirstName(), Store.YES));
doc.add(new TextField("lastName", employeeDto.getLastName(), Store.YES));
doc.add(new StoredField("employeeId", employeeDto.getEmployeeId()));
doc.add(new StoredField("salary", employeeDto.getSalary()));
doc.add(new StoredField("startDate", employeeDto.getStartDate().getTime()));
doc.add(new LongPoint("startDatePoint", employeeDto.getStartDate().getTime()));
doc.add(new StoredField("terminationDate", employeeDto.getTerminationDate().getTime()));
doc.add(new LongPoint("terminationDatePoint", employeeDto.getTerminationDate().getTime()));
w.addDocument(doc);
}
EDIT: The date fields are stored as both LongPoint and StoredField types. The LongPoint type can be used for the LongPoint.newRangeQuery but cannot be retrieved as a value later if you want to know what the date is. The StoredField type can be retrieved as a stored value but cannot be used for range queries. While this example does not retrieve the date fields the version 4 did have both functionalities. You could remove the StoredField dates if you don't plan on ever retrieving the values.

Related

Java, Lucene : Sort search results with highest hit rate.

I am working on a Spring-MVC application in which I am saving contents of user-data and using Lucene to index and search. Currently the functionality is working fine. Is it possible to sort the result with the highest matching probability first? I am currently saving paragraphs or more of text in indexes. Thank you.
Save code :
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.commit();
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
FieldType fieldType = new FieldType(TextField.TYPE_STORED);
fieldType.setTokenized(false);
if(groupNotes!=null) {
doc.add(new Field("contents", text + "\n" + tagFileName+"\n"+String.valueOf(groupNotes.getNoteNumber()), fieldType));
}else {
doc.add(new Field("contents", text + "\n" + tagFileName, fieldType));
}
Search code :
File file = new File(path.toString());
if ((file.isDirectory()) && (file.list().length > 0)) {
if(text.contains(" ")) {
String[] textArray = text.split(" ");
for(String str : textArray) {
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Query query = new WildcardQuery(new Term("contents","*"+str + "*"));
TopDocs topDocs = indexSearcher.search(query, 100);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println("Score is "+scoreDoc.score);
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
}
indexSearcher.getIndexReader().close();
directory.close();
}
}
}
}
Thank you.
Your question is not a bit very clear to me so below are just guessed answers ,
There are methods in IndexSearcher which take org.apache.lucene.search.Sort as argument ,
public TopFieldDocs search(Query query, int n,
Sort sort, boolean doDocScores, boolean doMaxScore) throws IOException OR
public TopFieldDocs search(Query query, int n, Sort sort) throws IOException
See if these methods solve your issue.
If you simply want to sort on the basis of scores then don't collect only document Ids but collect score too in a pojo that has that score field .
Collect all these pojos in some List then outside loop sort list on the basis
of score.
for (ScoreDoc hit : hits) {
//additional code
pojo.setScore(hit.score);
list.add(pojo);
}
then outside for loop ,
list.sort((POJO p1, POJO p2) -> p2
.getScore().compareTo(p1.getScore()));

Lucene4 search not working

I am new to Lucene, using Lucene4. Trying to create index for a huge RDBMS table and do search from lucene index instead of table directly. Gathered bit and pieces from different sites, tried it out and indexing "seems" to be working ok. Following files are created in index directory: _uu.fdt, _uu.fdx, _uu.fnm, _uu.si, segments.gen, segments_rs.
Tried retrieve a record from stored index but it did not work. Hit is failing, hit count is returning zero.
Code snippet for creating index:
ResultSet rs = stmt.executeQuery("SELECT product_id, product_name, brand_id, brand_name, price, screen_type, size_category, usage_category FROM mobile_product_master WHERE product_id like 'No0%'");
Directory storeIndexDirectory = FSDirectory.open(new File("E:\\index_dir"));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40));
while(rs.next())
{
productId = rs.getString("product_id");
productName = rs.getString("product_name");
brandId = rs.getString("brand_id");
brandName = rs.getString("brand_name");
price = rs.getString("price");
screenType = rs.getString("screen_type");
sizeCategory = rs.getString("size_category");
usageCategory = rs.getString("usage_category");
//doc = new Document(new Field());
doc = new Document();
doc.add(new Field("product_id",productId,Store.YES,Index.NO));
doc.add(new Field("product_name",productName,Store.YES,Index.NO));
doc.add(new Field("brand_id",brandId,Store.YES,Index.NO));
doc.add(new Field("brand_name",brandName,Store.YES,Index.NO));
doc.add(new Field("price",price,Store.YES,Index.NO));
doc.add(new Field("screen_type",screenType,Store.YES,Index.NO));
doc.add(new Field("size_category",sizeCategory,Store.YES,Index.NO));
doc.add(new Field("usage_category",usageCategory,Store.YES,Index.NO));
indexWriter = new IndexWriter(storeIndexDirectory, indexWriterConfig);
indexWriter.addDocument(doc);
indexWriter.close();
doc = null;
}
Code snippet for search:
String queryString = arg[0];
Directory storeIndexDirectory = FSDirectory.open(new File("E:\\index_dir"));
IndexReader indexReader = IndexReader.open(storeIndexDirectory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser parser = new QueryParser(Version.LUCENE_40,"product_id",new StandardAnalyzer(Version.LUCENE_40));
Query query = parser.parse(queryString);
TopDocs topDocs = indexSearcher.search(query,1000);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println(hits.length);
for(int i=0;i < hits.length; i++)
{
int docId = hits[i].doc;
Document d = indexSearcher.doc(docId);
System.out.println(d.get("product_id") + "," + d.get("product_name") + "," + d.get("brand_id") + "," + d.get("brand_name") + "," + d.get("price") + "," + d.get("screen_type") + "," + d.get("size_category") + "," + d.get("usage_category"));
}
I am not able to locate the error in search or indexing part.
With Lucene if you want that your field is "searchable" you must create a field with Index.YES.
In your example all new Field(...) statements have Index.NO parameter.
Change it to Index.YES only for a field you want to search.
You can also use TextField instead of generic Field with Index.YES.
Issue is resolved now. I used Index.ANALYZED while creating a field[adding to document] instead of using Index.NO. As SRS has pointed out, Index.YES would also work.
This raises a new question to me; In Lucene, I have to mark Index.YES/Index.ANALYZED to make the field searchable. So what is the case where someone would want a field to be created with searchable disabled? We use Lucene , store docs/fields to search it so in which use case do we use Index.No?!. Thanks.

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Lucene: prefix query not working with WhitespaceAnalyzer

I'm experimenting a little with Lucene's diverse Query objects and I'm trying to understand why a prefix query doesn't match any documents when using a WhitespaceAnaylzer for indexing. Consider the following test code:
protected String[] ids = { "1", "2" };
protected String[] unindexed = { "Netherlands", "Italy" };
protected String[] unstored = { "Amsterdam has lots of bridges",
"Venice has lots of canals" };
protected String[] text = { "Amsterdam", "Venice" };
#Test
public void testWhitespaceAnalyzerPrefixQuery() throws IOException, ParseException {
File indexes = new File(
"C:/LuceneInActionTutorial/indexes");
FSDirectory dir = FSDirectory.open(indexes);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9,
new LimitTokenCountAnalyzer(new WhitespaceAnalyzer(
Version.LUCENE_4_9), Integer.MAX_VALUE));
IndexWriter writer = new IndexWriter(dir, config);
for (int i = 0; i < ids.length; i++) {
Document doc = new Document();
doc.add(new StringField("id", ids[i], Store.NO));
doc.add(new StoredField("country", unindexed[i]));
doc.add(new TextField("contents", unstored[i], Store.NO));
doc.add(new Field("city", text[i], TextField.TYPE_STORED));
writer.addDocument(doc);
}
writer.close();
DirectoryReader dr = DirectoryReader.open(dir);
IndexSearcher is = new IndexSearcher(dr);
QueryParser queryParser = new QueryParser(Version.LUCENE_4_9,
"contents", new WhitespaceAnalyzer(Version.LUCENE_4_9));
queryParser.setLowercaseExpandedTerms(true);
Query q = queryParser.parse("Ven*");
assertTrue(q.getClass().getSimpleName().contains("PrefixQuery"));
TopDocs hits = is.search(q, 10);
assertEquals(1, hits.totalHits);
}
If I replace the WhitespaceAnaylzer with the StandardAnalyzer the test passes though. I used Luke to inspect the index content, but couldn't find any differences in how Lucene stores the values during indexing. Could anybody please clarify what's going wrong?
StandardAnalyzer lowercases text when it is indexed. WhitespaceAnalyzer does not. The term in the index, with WhitespaceAnalyzer is "Venice".
The query parser will lowercase your query though, since you have set setLowercaseExpandedTerms(true) (this is also the default, to disable this you need to explicitly set it to false). So your query is "ven*", which does not match "Venice".

How to get distinct value from Lucene Field

I am trying to make a search from Lucene index. I have created an index using StandardAnalyzer
I have the data to be indexed like following
course
BCA
MCA
BCA
BCA
MCA
When i search on course ="BCA" it returns me 3 times BCA but i want it should give the distinct values ie only once
I am using the following code
try {
File indexDir = new File("D:\\indexdirectory\\");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser1;
parser1= new QueryParser(Version.LUCENE_36, "course", new StandardAnalyzer(Version.LUCENE_36));
Query query = parser1.parse("BCA");
int maxhits = 5000;
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
int len = hits.length;
int docId;
Document d;
for(int j=0;j<len;j++) {
docId = hits[j].doc;
d = searcher.doc(docId);
String c = d.get("course");
System.out.println("Course = "+c);
}
}catch(Exception e) {
System.out.println("Exception occured"+e);
}
it returns BCA 3 times not only once as expected.

Categories