how to refine the search using apache lucene index - java

I am searching a keyword using index created by apache lucene , it returns the name of files which contains the given keyword now i want to refine the search again only in the files returned by lucene search . How is it possible to refine the search using apache lucene.
I am using the following code.
try
{
File indexDir=new File("path upto the index directory");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents", new SimpleAnalyzer(Version.LUCENE_36));
Query query = parser.parse(qu);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
int docId = 0;
Document d;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
filename= d.get(("filename"));
}
}
catch(Exception ex){ex.printStackTrace();}
I have added documents in the lucene index using as contents and filename.

You want to use a BooleanQuery for something like this. That will let you AND the original search constraints with the refined search constraints.
Example:
BooleanQuery query = new BooleanQuery();
Query origSearch = getOrigSearch(searchString);
Query refinement = makeRefinement();
query.add(origSearch, Occur.MUST);
query.add(refinement, Occur.MUST);
TopDocs topDocs = searcher.search(query, maxHits);

Related

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Deleting a document in apache lucene having exact match

I want to delete a document in apache lucene having exact match only. for example I have documents containing text:
Document1: Bilal
Document2: Bilal Ahmed
Doucument3: Bilal Ahmed - 54
And when Try to remove the document with query 'Bilal' it deletes all these three documents while it should delete just first document with exact match.
The Code I use is this:
String query = "bilal";
String field = "userNames";
Term term = new Term(field, query);
IndexWriter indexWriter = null;
File indexDir = new File(idexedDirectory);
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
indexWriter = new IndexWriter(directory, iwc);
indexWriter.deleteDocuments(term);
indexWriter.close();
This is how I am indexing my documents:
File indexDir = new File("C:\\Local DB\\TextFiled");
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
//Thirdly We tell the Index Writer that which document to index
indexWriter = new IndexWriter(directory, iwc);
int i = 0;
try (DataSource db = DataSource.getInstance()) {
PreparedStatement ps = db.getPreparedStatement(
"SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);
try (ResultSet resultSet = ps.executeQuery()) {
while (resultSet.next()) {
i++;
doc = new Document();
text = resultSet.getString("username");
doc.add(new StringField("userNames", text, Field.Store.YES));
indexWriter.addDocument(doc);
System.out.println("User Name : " + text + " : " + userID);
}
}
You have missed to provide how you index those documents. If they are indexed using StandardAnalyzer and tokenization is on, it is understandable that you get these results - this is because StandardAnalyzer tokenizes the text for each word and since each of your documents contains Bilal, you hit all those documents as a result.
The general advice is that you should always add a unique id field and query/delete by this id field.
If you can't do this - index the same text as a separate field - without tokenization - and use phrase query to find the exact match, but this sounds like a horrible hack to me.

search with in date range lucene and AND operator

I want to make a query which will give me data between date range and also by one more AND condition in lucene 3.0.1. This is the code for query between two dates :
IndexSearcher searcher = new IndexSearcher(directory);
String lowerDate = "2013-06-27";
String upperDate = "2013-06-29";
boolean includeLower = true;
boolean includeUpper = true;
TermRangeQuery query = new TermRangeQuery("created_at",lowerDate, upperDate, includeLower, includeUpper);
// display search results
TopDocs topDocs = searcher.search(query, 10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc.get("id"));
}
I have one more indexed column text, how can I include one more AND condition with this query, I am trying to get results within date range which also contain some keyword in test column.
You need to use a BooleanQuery, like:
TermRangeQuery dateQuery = new TermRangeQuery("created_at",lowerDate, upperDate, includeLower, includeUpper);
TermQuery keywordQuery = new TermQuery(new Term("keyword", "term"));
BooleanQuery bq = new BooleanQuery();
bq.add(new BooleanClause(dateQuery, BooleanClause.Occur.MUST))
bq.add(new BooleanClause(keywordQuery, BooleanClause.Occur.MUST))
// display search results
TopDocs topDocs = searcher.search(bq, 10);
Combining the two clauses, each with BooleanClause.Occur.MUST, is equivalent to an "AND" (take a look at the descriptions of the "MUST", "SHOULD" and "MUST_NOT" in the BooleanClause.Occur documentation to better understand your options with Lucene's "boolean" logic).

How to get distinct value from Lucene Field

I am trying to make a search from Lucene index. I have created an index using StandardAnalyzer
I have the data to be indexed like following
course
BCA
MCA
BCA
BCA
MCA
When i search on course ="BCA" it returns me 3 times BCA but i want it should give the distinct values ie only once
I am using the following code
try {
File indexDir = new File("D:\\indexdirectory\\");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser1;
parser1= new QueryParser(Version.LUCENE_36, "course", new StandardAnalyzer(Version.LUCENE_36));
Query query = parser1.parse("BCA");
int maxhits = 5000;
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
int len = hits.length;
int docId;
Document d;
for(int j=0;j<len;j++) {
docId = hits[j].doc;
d = searcher.doc(docId);
String c = d.get("course");
System.out.println("Course = "+c);
}
}catch(Exception e) {
System.out.println("Exception occured"+e);
}
it returns BCA 3 times not only once as expected.

search from apache lucene index and count the result group wise

I am trying to search from lucene index but i want to filter this search . there are two fields contents and and category .suppose i want to search in files which have "sports" and i also want to count to count how much files are in a and b category . I am trying to achive this with following code . But problem is that if there are millions of the records then it goes slow due to loop execution, suggest me another way to achieve the task.
try { File indexDir= new File("path of the file")
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
int maxhits=1000000;
QueryParser parser1 = new QueryParser(Version.LUCENE_36, "contents",
new StandardAnalyzer(Version.LUCENE_36));
Query qu=parser1.parse("sport");
TopDocs topDocs = searcher.search(, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"found times"+len);
int docId = 0;
Document d;
String category="";
int ctr=0,ctr1=0;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
category= d.get(("category"));
if(category.equals("a"))
ctr++;
if(category.equals("b"))
ctr1++;
}
JOptionPane.showMessageDialog("wprd found in category a times"+ctr);
JOptionPane.showMessageDialog("wprd found in category b times"+ctr1);
}
catch(Exception ex)
{
ex.printStackTrace();
}
You could just query for each category you are looking for and get totalHits. Better still would be to use a TotalHitCountCollector, instead of getting a TopDocs instance:
Query query = parser1.parser("+sport +category:a")
TotalHitCountCollector collector = new TotalHitCountCollector();
search.search(query, collector);
ctr = collector.getTotalHits();
query = parser1.parser("+sport +category:b")
collector = new TotalHitCountCollector();
search.search(query, collector);
ctr1 = collector.getTotalHits();

Categories