I'm new to text search and I'm studying some examples related to lucene. I found one of the example from this link. http://javatechniques.com/blog/lucene-in-memory-text-search-example/ I tried it in my eclipse IDE. But it gives some errors. I imported all the relevent jar files as well.
Here Is the code :
public class InMemoryExample {
public static void main(String[] args) {
// Construct a RAMDirectory to hold the in-memory representation
// of the index.
RAMDirectory idx = new RAMDirectory();
try {
// Make an writer to create the index
IndexWriter writer =
new IndexWriter(idx, new StandardAnalyzer(Version.LUCENE_48),
IndexWriter.MaxFieldLength.LIMITED);
// Add some Document objects containing quotes
writer.addDocument(createDocument("Theodore Roosevelt",
"It behooves every man to remember that the work of the " +
"critic, is of altogether secondary importance, and that, " +
"in the end, progress is accomplished by the man who does " +
"things."));
writer.addDocument(createDocument("Friedrich Hayek",
"The case for individual freedom rests largely on the " +
"recognition of the inevitable and universal ignorance " +
"of all of us concerning a great many of the factors on " +
"which the achievements of our ends and welfare depend."));
writer.addDocument(createDocument("Ayn Rand",
"There is nothing to take a man's freedom away from " +
"him, save other men. To be free, a man must be free " +
"of his brothers."));
writer.addDocument(createDocument("Mohandas Gandhi",
"Freedom is not worth having if it does not connote " +
"freedom to err."));
// Optimize and close the writer to finish building the index
writer.optimize();
writer.close();
// Build an IndexSearcher using the in-memory index
Searcher searcher = new IndexSearcher(idx);
// Run some queries
search(searcher, "freedom");
search(searcher, "free");
search(searcher, "progress or achievements");
searcher.close();
}
catch (IOException ioe) {
// In this example we aren't really doing an I/O, so this
// exception should never actually be thrown.
ioe.printStackTrace();
}
catch (ParseException pe) {
pe.printStackTrace();
}
}
/**
* Make a Document object with an un-indexed title field and an
* indexed content field.
*/
private static Document createDocument(String title, String content) {
Document doc = new Document();
// Add the title as an unindexed field...
doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));
// ...and the content as an indexed field. Note that indexed
// Text fields are constructed using a Reader. Lucene can read
// and index very large chunks of text, without storing the
// entire content verbatim in the index. In this example we
// can just wrap the content string in a StringReader.
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
return doc;
}
/**
* Searches for the given string in the "content" field
*/
private static void search(Searcher searcher, String queryString)
throws ParseException, IOException {
// Build a Query object
//Query query = QueryParser.parse(
QueryParser parser = new QueryParser("content", new StandardAnalyzer(Version.LUCENE_48));
Query query = parser.parse(queryString);
int hitsPerPage = 10;
// Search for the query
TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
int hitCount = collector.getTotalHits();
System.out.println(hitCount + " total matching documents");
// Examine the Hits object to see if there were any matches
if (hitCount == 0) {
System.out.println(
"No matches were found for \"" + queryString + "\"");
} else {
System.out.println("Hits for \"" +
queryString + "\" were found in quotes by:");
// Iterate over the Documents in the Hits object
for (int i = 0; i < hitCount; i++) {
// Document doc = hits.doc(i);
ScoreDoc scoreDoc = hits[i];
int docId = scoreDoc.doc;
float docScore = scoreDoc.score;
System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);
Document doc = searcher.doc(docId);
// Print the value that we stored in the "title" field. Note
// that this Field was not indexed, but (unlike the
// "contents" field) was stored verbatim and can be
// retrieved.
System.out.println(" " + (i + 1) + ". " + doc.get("title"));
System.out.println("Content: " + doc.get("content"));
}
}
System.out.println();
} }
but it shows few syntax errors in following lines :
Error 1:
IndexWriter writer = underline MaxFieldLength in red
new IndexWriter(idx, new StandardAnalyzer(Version.LUCENE_48),
IndexWriter.MaxFieldLength.LIMITED);
Error 2: underline optimeze() in red
writer.optimize();
Error 3: underline new IndexSearcher(idx) in red
Searcher searcher = new IndexSearcher(idx);
Error 4: underline search in red
searcher.search(query, collector);
Could you please help me to get rid of these errors? It will be a great help. Thanks
Modified code:
public class InMemoryExample {
public static void main(String[] args) throws Exception{
// Construct a RAMDirectory to hold the in-memory representation
// of the index.
RAMDirectory idx = new RAMDirectory();
// Make an writer to create the index
IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_48, new
StandardAnalyzer(Version.LUCENE_48));
IndexWriter writer = new IndexWriter(idx, cfg);
// Add some Document objects containing quotes
writer.addDocument(createDocument("Theodore Roosevelt",
"It behooves every man to remember that the work of the " +
"critic, is of altogether secondary importance, and that, " +
"in the end, progress is accomplished by the man who does " +
"things."));
writer.addDocument(createDocument("Friedrich Hayek",
"The case for individual freedom rests largely on the " +
"recognition of the inevitable and universal ignorance " +
"of all of us concerning a great many of the factors on " +
"which the achievements of our ends and welfare depend."));
writer.addDocument(createDocument("Ayn Rand",
"There is nothing to take a man's freedom away from " +
"him, save other men. To be free, a man must be free " +
"of his brothers."));
writer.addDocument(createDocument("Mohandas Gandhi",
"Freedom is not worth having if it does not connote " +
"freedom to err."));
// Optimize and close the writer to finish building the index
writer.commit();
writer.close();
// Build an IndexSearcher using the in-memory index
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(idx));
// Run some queries
search(searcher, "freedom");
search(searcher, "free");
search(searcher, "progress or achievements");
//searcher.close();
}
/**
* Make a Document object with an un-indexed title field and an
* indexed content field.
*/
private static Document createDocument(String title, String content) {
Document doc = new Document();
// Add the title as an unindexed field...
doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));
// ...and the content as an indexed field. Note that indexed
// Text fields are constructed using a Reader. Lucene can read
// and index very large chunks of text, without storing the
// entire content verbatim in the index. In this example we
// can just wrap the content string in a StringReader.
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED));
return doc;
}
/**
* Searches for the given string in the "content" field
*/
private static void search(IndexSearcher searcher, String queryString)
throws ParseException, IOException {
// Build a Query object
//Query query = QueryParser.parse(
QueryParser parser = new QueryParser("content", new StandardAnalyzer(Version.LUCENE_48));
Query query = parser.parse(queryString);
int hitsPerPage = 10;
// Search for the query
TopScoreDocCollector collector = TopScoreDocCollector.create(5 * hitsPerPage, false);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
int hitCount = collector.getTotalHits();
System.out.println(hitCount + " total matching documents");
// Examine the Hits object to see if there were any matches
if (hitCount == 0) {
System.out.println(
"No matches were found for \"" + queryString + "\"");
} else {
System.out.println("Hits for \"" +
queryString + "\" were found in quotes by:");
// Iterate over the Documents in the Hits object
for (int i = 0; i < hitCount; i++) {
// Document doc = hits.doc(i);
ScoreDoc scoreDoc = hits[i];
int docId = scoreDoc.doc;
float docScore = scoreDoc.score;
System.out.println("docId: " + docId + "\t" + "docScore: " + docScore);
Document doc = searcher.doc(docId);
// Print the value that we stored in the "title" field. Note
// that this Field was not indexed, but (unlike the
// "contents" field) was stored verbatim and can be
// retrieved.
System.out.println(" " + (i + 1) + ". " + doc.get("title"));
System.out.println("Content: " + doc.get("content"));
}
}
System.out.println();
} }
and this is the output:
Exception in thread "main" java.lang.VerifyError: class
org.apache.lucene.analysis.SimpleAnalyzer overrides final method
tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
at java.lang.ClassLoader.defineClass1(Native Method) at
java.lang.ClassLoader.defineClass(Unknown Source) at
java.security.SecureClassLoader.defineClass(Unknown Source) at
java.net.URLClassLoader.defineClass(Unknown Source) at
java.net.URLClassLoader.access$100(Unknown Source) at
java.net.URLClassLoader$1.run(Unknown Source) at
java.net.URLClassLoader$1.run(Unknown Source) at
java.security.AccessController.doPrivileged(Native Method) at
java.net.URLClassLoader.findClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at
sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at
java.lang.ClassLoader.loadClass(Unknown Source) at
beehex.inmemeory.textsearch.InMemoryExample.search(InMemoryExample.java:98)
at
beehex.inmemeory.textsearch.InMemoryExample.main(InMemoryExample.java:58)
I don't see a third argument on the IndexWriter constructor. You should modify The code to fit to the new lucene api like so :
IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_48, new StandardAnalyzer(Version.LUCENE_48));
IndexWriter writer = new IndexWriter(idx, cfg);
Also , rather than catching an exception here , i'd rather make my main method throw Exception and let the program fail altogether
EDIT :
2) remove the optimize call as the IndexWriter class does not have that method any longer (i think commit will do the trick here) .
3) define the IndexSearcher class like so :
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(idx));
Related
I have managed to create an index using some supplied Java code with Lucene and indexed 9 XML files successfully. I now have to modify another supplied Java file to search the index. I am able to output the number of hits, but I need to further modify the output so that when you submit a query it outputs the top ten results in the following format:-
Ranking: 0. Filename: .xml FilePath: c:/folder/movie.xml Score: 0.5
I'm trying to create a for loop, but none of the examples I have tried seem to work. This is my first venture with both Java and Lucene, so any help would be greatly appreciated.
public class LuceneSearch {
public int n = 0;
//String fileName;
/*searchIndex is the method involved with initiating Searching the Index
via the standardanalyzer by iterating through and using Hits for the results*/
public static void searchIndex(String searchString) throws IOException, ParseException {
String fieldContents = "summary";//current field name to search for. Each text item field name= 'contents'
String fileName = "filename";
String filePath = "filepath";
Directory directory = FSDirectory.getDirectory("/Users/Jac/Documents/index/");
//get index location
//initiate reader and searcher classes
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
//initiate standardanalyzer
Analyzer analyzer = new StandardAnalyzer();
//parse the query contents field with queryparser
QueryParser queryParser = new QueryParser(fieldContents, analyzer);
//get user query string
Query query = queryParser.parse(searchString.toLowerCase());
//Initiate HITS class and utilise methods
TopDocs hits = indexSearcher.search(query,null,10);
System.out.println("Searching for '" + searchString.toLowerCase() + "'");
//Hits hits = indexSearcher.search(query);
System.out.println("Number of hits: " + hits.totalHits);
System.out.println("Searching XML Tag Element '" + searchString.toLowerCase() + "'");
System.out.println("Number of hits: " + hits.totalHits);
for(ScoreDoc scoreDoc : hits.scoreDocs) {
// Document doc = IndexSearcher.doc(scoreDoc.doc);
System.out.println("Ranking: ");
// System.out.println(doc.get("fullpath"));
}
System.out.println("***Search Complete***");
}
public static void main(String[] args) throws Exception {
I am new to Lucene, using Lucene4. Trying to create index for a huge RDBMS table and do search from lucene index instead of table directly. Gathered bit and pieces from different sites, tried it out and indexing "seems" to be working ok. Following files are created in index directory: _uu.fdt, _uu.fdx, _uu.fnm, _uu.si, segments.gen, segments_rs.
Tried retrieve a record from stored index but it did not work. Hit is failing, hit count is returning zero.
Code snippet for creating index:
ResultSet rs = stmt.executeQuery("SELECT product_id, product_name, brand_id, brand_name, price, screen_type, size_category, usage_category FROM mobile_product_master WHERE product_id like 'No0%'");
Directory storeIndexDirectory = FSDirectory.open(new File("E:\\index_dir"));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40));
while(rs.next())
{
productId = rs.getString("product_id");
productName = rs.getString("product_name");
brandId = rs.getString("brand_id");
brandName = rs.getString("brand_name");
price = rs.getString("price");
screenType = rs.getString("screen_type");
sizeCategory = rs.getString("size_category");
usageCategory = rs.getString("usage_category");
//doc = new Document(new Field());
doc = new Document();
doc.add(new Field("product_id",productId,Store.YES,Index.NO));
doc.add(new Field("product_name",productName,Store.YES,Index.NO));
doc.add(new Field("brand_id",brandId,Store.YES,Index.NO));
doc.add(new Field("brand_name",brandName,Store.YES,Index.NO));
doc.add(new Field("price",price,Store.YES,Index.NO));
doc.add(new Field("screen_type",screenType,Store.YES,Index.NO));
doc.add(new Field("size_category",sizeCategory,Store.YES,Index.NO));
doc.add(new Field("usage_category",usageCategory,Store.YES,Index.NO));
indexWriter = new IndexWriter(storeIndexDirectory, indexWriterConfig);
indexWriter.addDocument(doc);
indexWriter.close();
doc = null;
}
Code snippet for search:
String queryString = arg[0];
Directory storeIndexDirectory = FSDirectory.open(new File("E:\\index_dir"));
IndexReader indexReader = IndexReader.open(storeIndexDirectory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser parser = new QueryParser(Version.LUCENE_40,"product_id",new StandardAnalyzer(Version.LUCENE_40));
Query query = parser.parse(queryString);
TopDocs topDocs = indexSearcher.search(query,1000);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println(hits.length);
for(int i=0;i < hits.length; i++)
{
int docId = hits[i].doc;
Document d = indexSearcher.doc(docId);
System.out.println(d.get("product_id") + "," + d.get("product_name") + "," + d.get("brand_id") + "," + d.get("brand_name") + "," + d.get("price") + "," + d.get("screen_type") + "," + d.get("size_category") + "," + d.get("usage_category"));
}
I am not able to locate the error in search or indexing part.
With Lucene if you want that your field is "searchable" you must create a field with Index.YES.
In your example all new Field(...) statements have Index.NO parameter.
Change it to Index.YES only for a field you want to search.
You can also use TextField instead of generic Field with Index.YES.
Issue is resolved now. I used Index.ANALYZED while creating a field[adding to document] instead of using Index.NO. As SRS has pointed out, Index.YES would also work.
This raises a new question to me; In Lucene, I have to mark Index.YES/Index.ANALYZED to make the field searchable. So what is the case where someone would want a field to be created with searchable disabled? We use Lucene , store docs/fields to search it so in which use case do we use Index.No?!. Thanks.
I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.
This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.
I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??
Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.
I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.