How to get distinct value from Lucene Field - java

I am trying to make a search from Lucene index. I have created an index using StandardAnalyzer
I have the data to be indexed like following
course
BCA
MCA
BCA
BCA
MCA
When i search on course ="BCA" it returns me 3 times BCA but i want it should give the distinct values ie only once
I am using the following code
try {
File indexDir = new File("D:\\indexdirectory\\");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser1;
parser1= new QueryParser(Version.LUCENE_36, "course", new StandardAnalyzer(Version.LUCENE_36));
Query query = parser1.parse("BCA");
int maxhits = 5000;
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
int len = hits.length;
int docId;
Document d;
for(int j=0;j<len;j++) {
docId = hits[j].doc;
d = searcher.doc(docId);
String c = d.get("course");
System.out.println("Course = "+c);
}
}catch(Exception e) {
System.out.println("Exception occured"+e);
}
it returns BCA 3 times not only once as expected.

Related

Lucene Search with Date parameter

I fairly new to Lucene framework. We are trying to implement Lucene framework since we need to search a LARGE amount of data within few milliseconds.
Scenario:
We have EmployeeDto which we have indexed in Lucene. For below
example, I have hardcoded only 6 values.
I have 2 arguments which should act as input parameters to the search
query.
EmployeeDto.java
private String firstName;
private String lastName;
private Long employeeId;
private Integer salary;
private Date startDate;
private Date terminationDate;
//getters and setters
EmployeeLucene.java
public class EmployeeLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
final DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd");
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
long starttimeOfLoad = Calendar.getInstance().getTimeInMillis();
System.out.println("Data Loading started");
addEmployee(w, new EmployeeDto("John", "Smith", new Long(101), 10000, DATE_FORMAT.parse("2010-05-05"), DATE_FORMAT.parse("2018-05-05")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 12000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2015-03-10")));
addEmployee(w, new EmployeeDto("Franklin", "Robinson", new Long(102), 12000, DATE_FORMAT.parse("2011-04-04"), DATE_FORMAT.parse("2015-07-07")));
addEmployee(w, new EmployeeDto("Thomas", "Boone", new Long(102), 12000, DATE_FORMAT.parse("2011-02-02"), DATE_FORMAT.parse("2015-03-10")));
addEmployee(w, new EmployeeDto("John", "Smith", new Long(103), 13000, DATE_FORMAT.parse("2019-05-05"), DATE_FORMAT.parse("2099-12-31")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 14000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2099-12-31")));
w.close();
System.out.println("Data Loaded. Completed in " + (Calendar.getInstance().getTimeInMillis() - starttimeOfLoad));
// 2. query
Query q = null;
try {
q = new QueryParser(Version.LUCENE_40, "fullName", analyzer).parse(args[0] + "*");
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
long starttime = Calendar.getInstance().getTimeInMillis();
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
List<EmployeeDto> employeeDtoList = new ArrayList<EmployeeDto>();
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
employeeDtoList.add(new EmployeeDto(d.get("firstName"), d.get("lastName"), Long.valueOf(d.get("employeeId")),
Integer.valueOf(d.get("salary"))));
}
System.out.println(employeeDtoList.size());
System.out.println(employeeDtoList);
System.out.println("Time taken:" + (Calendar.getInstance().getTimeInMillis() - starttime) + " ms");
}
private static void addEmployee(IndexWriter w, EmployeeDto employeeDto) throws IOException, ParseException {
Document doc = new Document();
doc.add(new TextField("fullName", employeeDto.getFirstName() + " " + employeeDto.getLastName(), Field.Store.YES));
doc.add(new TextField("firstName", employeeDto.getFirstName(), Field.Store.YES));
doc.add(new TextField("lastName", employeeDto.getLastName(), Field.Store.YES));
doc.add(new LongField("employeeId", employeeDto.getEmployeeId(), Field.Store.YES));
doc.add(new LongField("salary", employeeDto.getSalary(), Field.Store.YES));
doc.add(new LongField("startDate", employeeDto.getStartDate().getTime(), Field.Store.YES));
doc.add(new LongField("terminationDate", employeeDto.getTerminationDate().getTime(), Field.Store.YES));
w.addDocument(doc);
}
}
I run the program as "java EmployeeLucene thom 2014-05-05".
I should get only 2 values. but getting 3 hits.
Questions:
How to include the 2nd param in the Query string? 2nd param
should be greater than 'startDate' and lesser than 'terminationDate'
Can we include EmployeeDto itself inside the document to avoid
creation of List of EmployeeDtos once we get the hits.
First, you're going to get three results because you have three records with a full name that contains the string "thom*". They are records 2, 4, and 6.
Second, Lucene version 4.0 is really old.
Finally, one way to query for a date between startDate and terminationDate is as follows:
// 2. query
BooleanQuery finalQuery = null;
try {
// final query
finalQuery = new BooleanQuery();
// thom* query
Query fullName = new QueryParser(Version.LUCENE_40, "fullName", analyzer).parse("thom" + "*");
finalQuery.add(fullName, Occur.MUST); // MUST implies that the keyword must occur.
// greaterStartDate query
long searchDate = DATE_FORMAT.parse("2014-05-05").getTime();
Query greaterStartDate = NumericRangeQuery.newLongRange("startDate", null, searchDate, true, true);
finalQuery.add(greaterStartDate, Occur.MUST); // Using all "MUST" occurs is equivalent to "AND" operator
// lessTerminationDate query
Query lessTerminationDate = NumericRangeQuery.newLongRange("terminationDate", searchDate, null, false, false);
finalQuery.add(lessTerminationDate, Occur.MUST);
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
Can we include EmployeeDto itself inside the document to avoid creation of List of EmployeeDtos once we get the hits.
Not that I'm aware of.
EDIT: Version 7.0.1
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
final DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd");
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
long starttimeOfLoad = Calendar.getInstance().getTimeInMillis();
System.out.println("Data Loading started");
addEmployee(w, new EmployeeDto("John", "Smith", new Long(101), 10000, DATE_FORMAT.parse("2010-05-05"), DATE_FORMAT.parse("2018-05-05")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 12000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2015-10-10")));
addEmployee(w, new EmployeeDto("Franklin", "Robinson", new Long(102), 12000, DATE_FORMAT.parse("2011-04-04"), DATE_FORMAT.parse("2015-07-07")));
addEmployee(w, new EmployeeDto("Thomas", "Boone", new Long(102), 12000, DATE_FORMAT.parse("2011-02-02"), DATE_FORMAT.parse("2015-03-10")));
addEmployee(w, new EmployeeDto("John", "Smith", new Long(103), 13000, DATE_FORMAT.parse("2019-05-05"), DATE_FORMAT.parse("2099-12-31")));
addEmployee(w, new EmployeeDto("Bill", "Thomas", new Long(102), 14000, DATE_FORMAT.parse("2011-06-06"), DATE_FORMAT.parse("2099-12-31")));
w.close();
System.out.println("Data Loaded. Completed in " + (Calendar.getInstance().getTimeInMillis() - starttimeOfLoad));
// 2. query
BooleanQuery finalQuery = null;
try {
// final query
Builder builder = new BooleanQuery.Builder();
// thom* query
Query fullName = new QueryParser("fullName", analyzer).parse("thom" + "*");
builder.add(fullName, Occur.MUST); // MUST implies that the keyword must occur.
// greaterStartDate query
long searchDate = DATE_FORMAT.parse("2014-05-05").getTime();
Query greaterStartDate = LongPoint.newRangeQuery("startDatePoint", Long.MIN_VALUE, searchDate);
builder.add(greaterStartDate, Occur.MUST); // Using all "MUST" occurs is equivalent to "AND" operator
// lessTerminationDate query
Query lessTerminationDate = LongPoint.newRangeQuery("terminationDatePoint", searchDate, Long.MAX_VALUE);
builder.add(lessTerminationDate, Occur.MUST);
finalQuery = builder.build();
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
long starttime = Calendar.getInstance().getTimeInMillis();
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
searcher.search(finalQuery, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
List<EmployeeDto> employeeDtoList = new ArrayList<EmployeeDto>();
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
employeeDtoList.add(new EmployeeDto(d.get("firstName"), d.get("lastName"), Long.valueOf(d.get("employeeId")),
Integer.valueOf(d.get("salary"))));
}
System.out.println(employeeDtoList.size());
System.out.println(employeeDtoList);
System.out.println("Time taken:" + (Calendar.getInstance().getTimeInMillis() - starttime) + " ms");
}
private static void addEmployee(IndexWriter w, EmployeeDto employeeDto) throws IOException {
Document doc = new Document();
doc.add(new TextField("fullName", employeeDto.getFirstName() + " " + employeeDto.getLastName(), Store.YES));
doc.add(new TextField("firstName", employeeDto.getFirstName(), Store.YES));
doc.add(new TextField("lastName", employeeDto.getLastName(), Store.YES));
doc.add(new StoredField("employeeId", employeeDto.getEmployeeId()));
doc.add(new StoredField("salary", employeeDto.getSalary()));
doc.add(new StoredField("startDate", employeeDto.getStartDate().getTime()));
doc.add(new LongPoint("startDatePoint", employeeDto.getStartDate().getTime()));
doc.add(new StoredField("terminationDate", employeeDto.getTerminationDate().getTime()));
doc.add(new LongPoint("terminationDatePoint", employeeDto.getTerminationDate().getTime()));
w.addDocument(doc);
}
EDIT: The date fields are stored as both LongPoint and StoredField types. The LongPoint type can be used for the LongPoint.newRangeQuery but cannot be retrieved as a value later if you want to know what the date is. The StoredField type can be retrieved as a stored value but cannot be used for range queries. While this example does not retrieve the date fields the version 4 did have both functionalities. You could remove the StoredField dates if you don't plan on ever retrieving the values.

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Fetch Searched Data/Metadata In Lucene

Hi I am java developer and learning Lucene. I have a java class that index a pdf(lucene_in_action_2nd_edition.pdf) file and a search class that search some text from index. IndexSearcher is giving Document which shows that string exists in index(lucene_in_action_2nd_edition.pdf) or not.
But now I want to get searched data or metadata. i.e. I want to know that at which page string is matched, or few text around matched string, etc... How to do that?
Here is my LuceneSearcher.java class:
public static void main(String[] args) throws Exception {
File indexDir = new File("D:\\index");
String querystr = "Advantages of FastVectorHighlighter";
Query q = new QueryParser(Version.LUCENE_40, "contents",
new StandardAnalyzer(Version.LUCENE_40)).parse(querystr);
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + "... " + d.get("filename"));
System.out.println("=====================================================");
System.out.println(d.get("contents"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
Here d.get("contents") give full text(generated by Tika) of .pdf file, that was stored at time of indexing.
I want some information about searched text, so that I can show that on my web page or highlight searched text properly(like google search output). How to achieve that? Do we need to write some logic or Lucene does it internally?
Any type of help would be appreciated. Thanks in advance.
The org.apache.lucene.search.highlight package provides this functionality.
Such as:
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String text = doc.get("contents");
String bestFrag = highlighter.getBestFragment(analyzer, "contents", text);
//output, however you like.
You can also get a list of best Fragments from the highlighter, instead of just a single one, if you prefer, see the Highlighter API

search from apache lucene index and count the result group wise

I am trying to search from lucene index but i want to filter this search . there are two fields contents and and category .suppose i want to search in files which have "sports" and i also want to count to count how much files are in a and b category . I am trying to achive this with following code . But problem is that if there are millions of the records then it goes slow due to loop execution, suggest me another way to achieve the task.
try { File indexDir= new File("path of the file")
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
int maxhits=1000000;
QueryParser parser1 = new QueryParser(Version.LUCENE_36, "contents",
new StandardAnalyzer(Version.LUCENE_36));
Query qu=parser1.parse("sport");
TopDocs topDocs = searcher.search(, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"found times"+len);
int docId = 0;
Document d;
String category="";
int ctr=0,ctr1=0;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
category= d.get(("category"));
if(category.equals("a"))
ctr++;
if(category.equals("b"))
ctr1++;
}
JOptionPane.showMessageDialog("wprd found in category a times"+ctr);
JOptionPane.showMessageDialog("wprd found in category b times"+ctr1);
}
catch(Exception ex)
{
ex.printStackTrace();
}
You could just query for each category you are looking for and get totalHits. Better still would be to use a TotalHitCountCollector, instead of getting a TopDocs instance:
Query query = parser1.parser("+sport +category:a")
TotalHitCountCollector collector = new TotalHitCountCollector();
search.search(query, collector);
ctr = collector.getTotalHits();
query = parser1.parser("+sport +category:b")
collector = new TotalHitCountCollector();
search.search(query, collector);
ctr1 = collector.getTotalHits();

how to refine the search using apache lucene index

I am searching a keyword using index created by apache lucene , it returns the name of files which contains the given keyword now i want to refine the search again only in the files returned by lucene search . How is it possible to refine the search using apache lucene.
I am using the following code.
try
{
File indexDir=new File("path upto the index directory");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents", new SimpleAnalyzer(Version.LUCENE_36));
Query query = parser.parse(qu);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
int docId = 0;
Document d;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
filename= d.get(("filename"));
}
}
catch(Exception ex){ex.printStackTrace();}
I have added documents in the lucene index using as contents and filename.
You want to use a BooleanQuery for something like this. That will let you AND the original search constraints with the refined search constraints.
Example:
BooleanQuery query = new BooleanQuery();
Query origSearch = getOrigSearch(searchString);
Query refinement = makeRefinement();
query.add(origSearch, Occur.MUST);
query.add(refinement, Occur.MUST);
TopDocs topDocs = searcher.search(query, maxHits);

Categories