search from apache lucene index and count the result group wise - java

I am trying to search from lucene index but i want to filter this search . there are two fields contents and and category .suppose i want to search in files which have "sports" and i also want to count to count how much files are in a and b category . I am trying to achive this with following code . But problem is that if there are millions of the records then it goes slow due to loop execution, suggest me another way to achieve the task.
try { File indexDir= new File("path of the file")
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
int maxhits=1000000;
QueryParser parser1 = new QueryParser(Version.LUCENE_36, "contents",
new StandardAnalyzer(Version.LUCENE_36));
Query qu=parser1.parse("sport");
TopDocs topDocs = searcher.search(, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"found times"+len);
int docId = 0;
Document d;
String category="";
int ctr=0,ctr1=0;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
category= d.get(("category"));
if(category.equals("a"))
ctr++;
if(category.equals("b"))
ctr1++;
}
JOptionPane.showMessageDialog("wprd found in category a times"+ctr);
JOptionPane.showMessageDialog("wprd found in category b times"+ctr1);
}
catch(Exception ex)
{
ex.printStackTrace();
}

You could just query for each category you are looking for and get totalHits. Better still would be to use a TotalHitCountCollector, instead of getting a TopDocs instance:
Query query = parser1.parser("+sport +category:a")
TotalHitCountCollector collector = new TotalHitCountCollector();
search.search(query, collector);
ctr = collector.getTotalHits();
query = parser1.parser("+sport +category:b")
collector = new TotalHitCountCollector();
search.search(query, collector);
ctr1 = collector.getTotalHits();

Related

Prefix search using lucene

I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.

Fetch Searched Data/Metadata In Lucene

Hi I am java developer and learning Lucene. I have a java class that index a pdf(lucene_in_action_2nd_edition.pdf) file and a search class that search some text from index. IndexSearcher is giving Document which shows that string exists in index(lucene_in_action_2nd_edition.pdf) or not.
But now I want to get searched data or metadata. i.e. I want to know that at which page string is matched, or few text around matched string, etc... How to do that?
Here is my LuceneSearcher.java class:
public static void main(String[] args) throws Exception {
File indexDir = new File("D:\\index");
String querystr = "Advantages of FastVectorHighlighter";
Query q = new QueryParser(Version.LUCENE_40, "contents",
new StandardAnalyzer(Version.LUCENE_40)).parse(querystr);
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + "... " + d.get("filename"));
System.out.println("=====================================================");
System.out.println(d.get("contents"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
Here d.get("contents") give full text(generated by Tika) of .pdf file, that was stored at time of indexing.
I want some information about searched text, so that I can show that on my web page or highlight searched text properly(like google search output). How to achieve that? Do we need to write some logic or Lucene does it internally?
Any type of help would be appreciated. Thanks in advance.
The org.apache.lucene.search.highlight package provides this functionality.
Such as:
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String text = doc.get("contents");
String bestFrag = highlighter.getBestFragment(analyzer, "contents", text);
//output, however you like.
You can also get a list of best Fragments from the highlighter, instead of just a single one, if you prefer, see the Highlighter API

How to get distinct value from Lucene Field

I am trying to make a search from Lucene index. I have created an index using StandardAnalyzer
I have the data to be indexed like following
course
BCA
MCA
BCA
BCA
MCA
When i search on course ="BCA" it returns me 3 times BCA but i want it should give the distinct values ie only once
I am using the following code
try {
File indexDir = new File("D:\\indexdirectory\\");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser1;
parser1= new QueryParser(Version.LUCENE_36, "course", new StandardAnalyzer(Version.LUCENE_36));
Query query = parser1.parse("BCA");
int maxhits = 5000;
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
int len = hits.length;
int docId;
Document d;
for(int j=0;j<len;j++) {
docId = hits[j].doc;
d = searcher.doc(docId);
String c = d.get("course");
System.out.println("Course = "+c);
}
}catch(Exception e) {
System.out.println("Exception occured"+e);
}
it returns BCA 3 times not only once as expected.

how to refine the search using apache lucene index

I am searching a keyword using index created by apache lucene , it returns the name of files which contains the given keyword now i want to refine the search again only in the files returned by lucene search . How is it possible to refine the search using apache lucene.
I am using the following code.
try
{
File indexDir=new File("path upto the index directory");
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents", new SimpleAnalyzer(Version.LUCENE_36));
Query query = parser.parse(qu);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
int docId = 0;
Document d;
for ( i = 0; i<len; i++) {
docId = hits[i].doc;
d = searcher.doc(docId);
filename= d.get(("filename"));
}
}
catch(Exception ex){ex.printStackTrace();}
I have added documents in the lucene index using as contents and filename.
You want to use a BooleanQuery for something like this. That will let you AND the original search constraints with the refined search constraints.
Example:
BooleanQuery query = new BooleanQuery();
Query origSearch = getOrigSearch(searchString);
Query refinement = makeRefinement();
query.add(origSearch, Occur.MUST);
query.add(refinement, Occur.MUST);
TopDocs topDocs = searcher.search(query, maxHits);

Missing hits on lucene index search

i index one big database overview (just text fields) on which the user must be able to search (below in indexFields method). This search before was done in the database with ILIKE query, but was slow, so now search is done on index. Hovewer, when i compare search results from db query, and results i get with the index search, there is always much less results with search from index.
Im not sure if i am making mistake in indexing or in search process. To me all seems to make sense here. Any ideas?
Here is the code. All advices appreciated!
// INDEXING
StandardAnalyzer analyzer = new StandardAnalyzer(
Version.LUCENE_CURRENT, stopSet); // stop set is empty
IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
indexFields(writer);
writer.optimize();
writer.commit();
writer.close();
analyzer.close();
private void indexFields(IndexWriter writer) {
DetachedCriteria criteria = DetachedCriteria
.forClass(Activit.class);
int count = 0;
int max = 50000;
boolean existMoreToIndex = true;
List<Activit> result = new ArrayList<Activit>();
while (existMoreToIndex) {
try {
result = activitService.listPaged(count, max);
if (result.size() < max)
existMoreToIndex = false;
if (result.size() == 0)
return;
for (Activit ao : result) {
Document doc = new Document();
doc.add(new Field("id", String.valueOf(ao.getId()),
Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitOwner()!=null)
doc.add(new Field("field1", ao.getActivityOwner(),Field.Store.YES, Field.Index.ANALYZED));
if(ao.getActivitResponsible() != null)
doc.add(new Field("field2", ao.getActivityResponsible(), Field.Store.YES,Field.Index.ANALYZED));
try {
writer.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
}
count += max;
//SEARCH
public List<Activit> searchActivitiesInIndex(String searchCriteria) {
Set<String> stopSet = new HashSet<String>(); // empty because we do not want to remove stop words
Version version = Version.LUCENE_CURRENT;
String[] fields = {
"field1", "field2"};
try {
File tempFile = new File("C://testindex");
Directory INDEX_DIR = new SimpleFSDirectory(tempFile);
Searcher searcher = new IndexSearcher(INDEX_DIR, true);
QueryParser parser = new MultiFieldQueryParser(version, fields, new StandardAnalyzer(
version, stopSet));
Query query = parser.parse(searchCriteria);
TopDocs topDocs = searcher.search(query, 500);
ScoreDoc[] hits = topDocs.scoreDocs;
//here i always get smaller hits lenght
searcher.close();
} catch (Exception e) {
e.printStackTrace();
}
}
Most likely the analyzer is doing something that you aren't expecting.
Open your index using Luke, you can see what your (analyzed) indexed documents look like, as well as your parsed queries - should let you see what's going wrong.
Also, can you give an example of searchCriteria? And the corresponding SQL query? Without that, it's hard to know if the indexing is done correctly. You may also not need to use MultiFieldQueryParser, which is quite inefficient.

Categories