Lucene searching and ranking with topdocs - java

I have managed to create an index using some supplied Java code with Lucene and indexed 9 XML files successfully. I now have to modify another supplied Java file to search the index. I am able to output the number of hits, but I need to further modify the output so that when you submit a query it outputs the top ten results in the following format:-
Ranking: 0. Filename: .xml FilePath: c:/folder/movie.xml Score: 0.5
I'm trying to create a for loop, but none of the examples I have tried seem to work. This is my first venture with both Java and Lucene, so any help would be greatly appreciated.
public class LuceneSearch {
public int n = 0;
//String fileName;
/*searchIndex is the method involved with initiating Searching the Index
via the standardanalyzer by iterating through and using Hits for the results*/
public static void searchIndex(String searchString) throws IOException, ParseException {
String fieldContents = "summary";//current field name to search for. Each text item field name= 'contents'
String fileName = "filename";
String filePath = "filepath";
Directory directory = FSDirectory.getDirectory("/Users/Jac/Documents/index/");
//get index location
//initiate reader and searcher classes
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//initiate standardanalyzer
Analyzer analyzer = new StandardAnalyzer();
//parse the query contents field with queryparser
QueryParser queryParser = new QueryParser(fieldContents, analyzer);
//get user query string
Query query = queryParser.parse(searchString.toLowerCase());
//Initiate HITS class and utilise methods
TopDocs hits = indexSearcher.search(query,null,10);
System.out.println("Searching for '" + searchString.toLowerCase() + "'");
//Hits hits = indexSearcher.search(query);
System.out.println("Number of hits: " + hits.totalHits);
System.out.println("Searching XML Tag Element '" + searchString.toLowerCase() + "'");
System.out.println("Number of hits: " + hits.totalHits);
for(ScoreDoc scoreDoc : hits.scoreDocs) {
// Document doc = IndexSearcher.doc(scoreDoc.doc);
System.out.println("Ranking: ");
// System.out.println(doc.get("fullpath"));
}
System.out.println("***Search Complete***");
}
public static void main(String[] args) throws Exception {

Related

Error in Lucene text Search

I'm new to text search and I'm studying some examples related to lucene. I found one of the example from this link. http://javatechniques.com/blog/lucene-in-memory-text-search-example/ I tried it in my eclipse IDE. But it gives some errors. I imported all the relevent jar files as well.
Here Is the code :
public class InMemoryExample {
public static void main(String[] args) {
// Construct a RAMDirectory to hold the in-memory representation
// of the index.
RAMDirectory idx = new RAMDirectory();
try {
// Make an writer to create the index
IndexWriter writer =
new IndexWriter(idx, new StandardAnalyzer(Version.LUCENE_48),
IndexWriter.MaxFieldLength.LIMITED);
// Add some Document objects containing quotes
writer.addDocument(createDocument("Theodore Roosevelt",
"It behooves every man to remember that the work of the " +
"critic, is of altogether secondary importance, and that, " +
"in the end, progress is accomplished by the man who does " +
"things."));
writer.addDocument(createDocument("Friedrich Hayek",
"The case for individual freedom rests largely on the " +
"recognition of the inevitable and universal ignorance " +
"of all of us concerning a great many of the factors on " +
"which the achievements of our ends and welfare depend."));
writer.addDocument(createDocument("Ayn Rand",
"There is nothing to take a man's freedom away from " +
"him, save other men. To be free, a man must be free " +
"of his brothers."));
writer.addDocument(createDocument("Mohandas Gandhi",
"Freedom is not worth having if it does not connote " +
"freedom to err."));
// Optimize and close the writer to finish building the index
writer.optimize();
writer.close();
// Build an IndexSearcher using the in-memory index
Searcher searcher = new IndexSearcher(idx);
// Run some queries
search(searcher, "freedom");
search(searcher, "free");
search(searcher, "progress or achievements");
searcher.close();
}
catch (IOException ioe) {
// In this example we aren't really doing an I/O, so this
// exception should never actually be thrown.
ioe.printStackTrace();
}
catch (ParseException pe) {
pe.printStackTrace();
}
}
/**
* Make a Document object with an un-indexed title field and an
* indexed content field.
*/
private static Document createDocument(String title, String content) {
Document doc = new Document();
// Add the title as an unindexed field...
doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));
// ...and the content as an indexed field. Note that indexed
// Text fields are constructed using a Reader. Lucene can read
// and index very large chunks of text, without storing the
// entire content verbatim in the index. In this example we
// can just wrap the content string in a StringReader.
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
return doc;
}
/**
* Searches for the given string in the "content" field
*/
private static void search(Searcher searcher, String queryString)
throws ParseException, IOException {
// Build a Query object
//Query query = QueryParser.parse(
QueryParser parser = new QueryParser("content", new StandardAnalyzer(Version.LUCENE_48));
Query query = parser.parse(queryString);
int hitsPerPage = 10;
// Search for the query
TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
int hitCount = collector.getTotalHits();
System.out.println(hitCount + " total matching documents");
// Examine the Hits object to see if there were any matches
if (hitCount == 0) {
System.out.println(
"No matches were found for \"" + queryString + "\"");
} else {
System.out.println("Hits for \"" +
queryString + "\" were found in quotes by:");
// Iterate over the Documents in the Hits object
for (int i = 0; i < hitCount; i++) {
// Document doc = hits.doc(i);
ScoreDoc scoreDoc = hits[i];
int docId = scoreDoc.doc;
float docScore = scoreDoc.score;
System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);
Document doc = searcher.doc(docId);
// Print the value that we stored in the "title" field. Note
// that this Field was not indexed, but (unlike the
// "contents" field) was stored verbatim and can be
// retrieved.
System.out.println(" " + (i + 1) + ". " + doc.get("title"));
System.out.println("Content: " + doc.get("content"));
}
}
System.out.println();
} }
but it shows few syntax errors in following lines :
Error 1:
IndexWriter writer = underline MaxFieldLength in red
new IndexWriter(idx, new StandardAnalyzer(Version.LUCENE_48),
IndexWriter.MaxFieldLength.LIMITED);
Error 2: underline optimeze() in red
writer.optimize();
Error 3: underline new IndexSearcher(idx) in red
Searcher searcher = new IndexSearcher(idx);
Error 4: underline search in red
searcher.search(query, collector);
Could you please help me to get rid of these errors? It will be a great help. Thanks
Modified code:
public class InMemoryExample {
public static void main(String[] args) throws Exception{
// Construct a RAMDirectory to hold the in-memory representation
// of the index.
RAMDirectory idx = new RAMDirectory();
// Make an writer to create the index
IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_48, new
StandardAnalyzer(Version.LUCENE_48));
IndexWriter writer = new IndexWriter(idx, cfg);
// Add some Document objects containing quotes
writer.addDocument(createDocument("Theodore Roosevelt",
"It behooves every man to remember that the work of the " +
"critic, is of altogether secondary importance, and that, " +
"in the end, progress is accomplished by the man who does " +
"things."));
writer.addDocument(createDocument("Friedrich Hayek",
"The case for individual freedom rests largely on the " +
"recognition of the inevitable and universal ignorance " +
"of all of us concerning a great many of the factors on " +
"which the achievements of our ends and welfare depend."));
writer.addDocument(createDocument("Ayn Rand",
"There is nothing to take a man's freedom away from " +
"him, save other men. To be free, a man must be free " +
"of his brothers."));
writer.addDocument(createDocument("Mohandas Gandhi",
"Freedom is not worth having if it does not connote " +
"freedom to err."));
// Optimize and close the writer to finish building the index
writer.commit();
writer.close();
// Build an IndexSearcher using the in-memory index
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(idx));
// Run some queries
search(searcher, "freedom");
search(searcher, "free");
search(searcher, "progress or achievements");
//searcher.close();
}
/**
* Make a Document object with an un-indexed title field and an
* indexed content field.
*/
private static Document createDocument(String title, String content) {
Document doc = new Document();
// Add the title as an unindexed field...
doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));
// ...and the content as an indexed field. Note that indexed
// Text fields are constructed using a Reader. Lucene can read
// and index very large chunks of text, without storing the
// entire content verbatim in the index. In this example we
// can just wrap the content string in a StringReader.
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
return doc;
}
/**
* Searches for the given string in the "content" field
*/
private static void search(IndexSearcher searcher, String queryString)
throws ParseException, IOException {
// Build a Query object
//Query query = QueryParser.parse(
QueryParser parser = new QueryParser("content", new StandardAnalyzer(Version.LUCENE_48));
Query query = parser.parse(queryString);
int hitsPerPage = 10;
// Search for the query
TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
int hitCount = collector.getTotalHits();
System.out.println(hitCount + " total matching documents");
// Examine the Hits object to see if there were any matches
if (hitCount == 0) {
System.out.println(
"No matches were found for \"" + queryString + "\"");
} else {
System.out.println("Hits for \"" +
queryString + "\" were found in quotes by:");
// Iterate over the Documents in the Hits object
for (int i = 0; i < hitCount; i++) {
// Document doc = hits.doc(i);
ScoreDoc scoreDoc = hits[i];
int docId = scoreDoc.doc;
float docScore = scoreDoc.score;
System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);
Document doc = searcher.doc(docId);
// Print the value that we stored in the "title" field. Note
// that this Field was not indexed, but (unlike the
// "contents" field) was stored verbatim and can be
// retrieved.
System.out.println(" " + (i + 1) + ". " + doc.get("title"));
System.out.println("Content: " + doc.get("content"));
}
}
System.out.println();
} }
and this is the output:
Exception in thread "main" java.lang.VerifyError: class
org.apache.lucene.analysis.SimpleAnalyzer overrides final method
tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
at java.lang.ClassLoader.defineClass1(Native Method) at
java.lang.ClassLoader.defineClass(Unknown Source) at
java.security.SecureClassLoader.defineClass(Unknown Source) at
java.net.URLClassLoader.defineClass(Unknown Source) at
java.net.URLClassLoader.access$100(Unknown Source) at
java.net.URLClassLoader$1.run(Unknown Source) at
java.net.URLClassLoader$1.run(Unknown Source) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at
sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at
beehex.inmemeory.textsearch.InMemoryExample.search(InMemoryExample.java:98)
at
beehex.inmemeory.textsearch.InMemoryExample.main(InMemoryExample.java:58)
I don't see a third argument on the IndexWriter constructor. You should modify The code to fit to the new lucene api like so :
IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_48, new StandardAnalyzer(Version.LUCENE_48));
IndexWriter writer = new IndexWriter(idx, cfg);
Also , rather than catching an exception here , i'd rather make my main method throw Exception and let the program fail altogether
EDIT :
2) remove the optimize call as the IndexWriter class does not have that method any longer (i think commit will do the trick here) .
3) define the IndexSearcher class like so :
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(idx));

Fetch Searched Data/Metadata In Lucene

Hi I am java developer and learning Lucene. I have a java class that index a pdf(lucene_in_action_2nd_edition.pdf) file and a search class that search some text from index. IndexSearcher is giving Document which shows that string exists in index(lucene_in_action_2nd_edition.pdf) or not.
But now I want to get searched data or metadata. i.e. I want to know that at which page string is matched, or few text around matched string, etc... How to do that?
Here is my LuceneSearcher.java class:
public static void main(String[] args) throws Exception {
File indexDir = new File("D:\\index");
String querystr = "Advantages of FastVectorHighlighter";
Query q = new QueryParser(Version.LUCENE_40, "contents",
new StandardAnalyzer(Version.LUCENE_40)).parse(querystr);
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + "... " + d.get("filename"));
System.out.println("=====================================================");
System.out.println(d.get("contents"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
Here d.get("contents") give full text(generated by Tika) of .pdf file, that was stored at time of indexing.
I want some information about searched text, so that I can show that on my web page or highlight searched text properly(like google search output). How to achieve that? Do we need to write some logic or Lucene does it internally?
Any type of help would be appreciated. Thanks in advance.
The org.apache.lucene.search.highlight package provides this functionality.
Such as:
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String text = doc.get("contents");
String bestFrag = highlighter.getBestFragment(analyzer, "contents", text);
//output, however you like.
You can also get a list of best Fragments from the highlighter, instead of just a single one, if you prefer, see the Highlighter API

How will I go about indexing a customer using Lucene

I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.
This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.

Lucene 4.0 in text search

I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.

lucene get matched terms in query

What is the best way to find out which terms in a query matched against a given document returned as a hit in lucene?
I have tried a weird method involving hit highlighting package in lucene contrib and also a method that searches for every word in the query against the top most document ("docId: xy AND description: each_word_in_query").
Do not get satisfactory results?
Hit highlighting does not report some of the words that matched for a document other than the first one.
I'm not sure if the second approach is the best alternative.
The method explain in the Searcher is a nice way to see which part of a query was matched and how it affects the overall score.
Example taken from the book Lucene In Action 2nd Edition:
public class Explainer {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: Explainer <index dir> <query>");
System.exit(1);
}
String indexDir = args[0];
String queryExpression = args[1];
Directory directory = FSDirectory.open(new File(indexDir));
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT,
"contents", new SimpleAnalyzer());
Query query = parser.parse(queryExpression);
System.out.println("Query: " + queryExpression);
IndexSearcher searcher = new IndexSearcher(directory);
TopDocs topDocs = searcher.search(query, 10);
for (int i = 0; i < topDocs.totalHits; i++) {
ScoreDoc match = topDocs.scoreDocs[i];
Explanation explanation = searcher.explain(query, match.doc);
System.out.println("----------");
Document doc = searcher.doc(match.doc);
System.out.println(doc.get("title"));
System.out.println(explanation.toString());
}
}
}
This will explain the score of each document that matches the query.
Not tried yet, but have a look at the implementation of org.apache.lucene.search.highlight.QueryTermExtractor.

Categories