how to refine the search using apache lucene index

how to refine the search using apache lucene index - java

I am searching a keyword using index created by apache lucene , it returns the name of files which contains the given keyword now i want to refine the search again only in the files returned by lucene search . How is it possible to refine the search using apache lucene.
I am using the following code.
try
{
File indexDir=new File("path upto the index directory");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents", new SimpleAnalyzer(Version.LUCENE_36));
Query query = parser.parse(qu);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
int docId = 0;
Document d;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
filename= d.get(("filename"));
}
}
catch(Exception ex){ex.printStackTrace();}
I have added documents in the lucene index using as contents and filename.

You want to use a BooleanQuery for something like this. That will let you AND the original search constraints with the refined search constraints.
Example:
BooleanQuery query = new BooleanQuery();
Query origSearch = getOrigSearch(searchString);
Query refinement = makeRefinement();
query.add(origSearch, Occur.MUST);
query.add(refinement, Occur.MUST);
TopDocs topDocs = searcher.search(query, maxHits);

Related

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}

I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);

If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Deleting a document in apache lucene having exact match

I want to delete a document in apache lucene having exact match only. for example I have documents containing text:
Document1: Bilal
Document2: Bilal Ahmed
Doucument3: Bilal Ahmed - 54
And when Try to remove the document with query 'Bilal' it deletes all these three documents while it should delete just first document with exact match.
The Code I use is this:
String query = "bilal";
String field = "userNames";
Term term = new Term(field, query);
IndexWriter indexWriter = null;
File indexDir = new File(idexedDirectory);
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
indexWriter = new IndexWriter(directory, iwc);
indexWriter.deleteDocuments(term);
indexWriter.close();
This is how I am indexing my documents:
File indexDir = new File("C:\\Local DB\\TextFiled");
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
//Thirdly We tell the Index Writer that which document to index
indexWriter = new IndexWriter(directory, iwc);
int i = 0;
try (DataSource db = DataSource.getInstance()) {
PreparedStatement ps = db.getPreparedStatement(
"SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);
try (ResultSet resultSet = ps.executeQuery()) {
while (resultSet.next()) {
i++;
doc = new Document();
text = resultSet.getString("username");
doc.add(new StringField("userNames", text, Field.Store.YES));
indexWriter.addDocument(doc);
System.out.println("User Name : " + text + " : " + userID);
}
}

You have missed to provide how you index those documents. If they are indexed using StandardAnalyzer and tokenization is on, it is understandable that you get these results - this is because StandardAnalyzer tokenizes the text for each word and since each of your documents contains Bilal, you hit all those documents as a result.
The general advice is that you should always add a unique id field and query/delete by this id field.
If you can't do this - index the same text as a separate field - without tokenization - and use phrase query to find the exact match, but this sounds like a horrible hack to me.

search with in date range lucene and AND operator

I want to make a query which will give me data between date range and also by one more AND condition in lucene 3.0.1. This is the code for query between two dates :
IndexSearcher searcher = new IndexSearcher(directory);
String lowerDate = "2013-06-27";
String upperDate = "2013-06-29";
boolean includeLower = true;
boolean includeUpper = true;
TermRangeQuery query = new TermRangeQuery("created_at",lowerDate, upperDate, includeLower, includeUpper);
// display search results
TopDocs topDocs = searcher.search(query, 10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document doc = searcher.doc(scoreDoc.doc);
System.out.println(doc.get("id"));
}
I have one more indexed column text, how can I include one more AND condition with this query, I am trying to get results within date range which also contain some keyword in test column.

You need to use a BooleanQuery, like:
TermRangeQuery dateQuery = new TermRangeQuery("created_at",lowerDate, upperDate, includeLower, includeUpper);
TermQuery keywordQuery = new TermQuery(new Term("keyword", "term"));
BooleanQuery bq = new BooleanQuery();
bq.add(new BooleanClause(dateQuery, BooleanClause.Occur.MUST))
bq.add(new BooleanClause(keywordQuery, BooleanClause.Occur.MUST))
// display search results
TopDocs topDocs = searcher.search(bq, 10);
Combining the two clauses, each with BooleanClause.Occur.MUST, is equivalent to an "AND" (take a look at the descriptions of the "MUST", "SHOULD" and "MUST_NOT" in the BooleanClause.Occur documentation to better understand your options with Lucene's "boolean" logic).

How to get distinct value from Lucene Field

I am trying to make a search from Lucene index. I have created an index using StandardAnalyzer
I have the data to be indexed like following
course
BCA
MCA
BCA
BCA
MCA
When i search on course ="BCA" it returns me 3 times BCA but i want it should give the distinct values ie only once
I am using the following code
try {
File indexDir = new File("D:\\indexdirectory\\");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser1;
parser1= new QueryParser(Version.LUCENE_36, "course", new StandardAnalyzer(Version.LUCENE_36));
Query query = parser1.parse("BCA");
int maxhits = 5000;
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
int len = hits.length;
int docId;
Document d;
for(int j=0;j<len;j++) {
docId = hits[j].doc;
d = searcher.doc(docId);
String c = d.get("course");
System.out.println("Course = "+c);
}
}catch(Exception e) {
System.out.println("Exception occured"+e);
}
it returns BCA 3 times not only once as expected.

search from apache lucene index and count the result group wise

I am trying to search from lucene index but i want to filter this search . there are two fields contents and and category .suppose i want to search in files which have "sports" and i also want to count to count how much files are in a and b category . I am trying to achive this with following code . But problem is that if there are millions of the records then it goes slow due to loop execution, suggest me another way to achieve the task.
try { File indexDir= new File("path of the file")
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
int maxhits=1000000;
QueryParser parser1 = new QueryParser(Version.LUCENE_36, "contents",
new StandardAnalyzer(Version.LUCENE_36));
Query qu=parser1.parse("sport");
TopDocs topDocs = searcher.search(, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"found times"+len);
int docId = 0;
Document d;
String category="";
int ctr=0,ctr1=0;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
category= d.get(("category"));
if(category.equals("a"))
ctr++;
if(category.equals("b"))
ctr1++;
}
JOptionPane.showMessageDialog("wprd found in category a times"+ctr);
JOptionPane.showMessageDialog("wprd found in category b times"+ctr1);
}
catch(Exception ex)
{
ex.printStackTrace();
}

You could just query for each category you are looking for and get totalHits. Better still would be to use a TotalHitCountCollector, instead of getting a TopDocs instance:
Query query = parser1.parser("+sport +category:a")
TotalHitCountCollector collector = new TotalHitCountCollector();
search.search(query, collector);
ctr = collector.getTotalHits();
query = parser1.parser("+sport +category:b")
collector = new TotalHitCountCollector();
search.search(query, collector);
ctr1 = collector.getTotalHits();

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to refine the search using apache lucene index - java

Related

Prefix search using lucene

Deleting a document in apache lucene having exact match

search with in date range lucene and AND operator

How to get distinct value from Lucene Field

search from apache lucene index and count the result group wise

Categories

Resources