I am new to Lucene, using Lucene4. Trying to create index for a huge RDBMS table and do search from lucene index instead of table directly. Gathered bit and pieces from different sites, tried it out and indexing "seems" to be working ok. Following files are created in index directory: _uu.fdt, _uu.fdx, _uu.fnm, _uu.si, segments.gen, segments_rs.
Tried retrieve a record from stored index but it did not work. Hit is failing, hit count is returning zero.
Code snippet for creating index:
ResultSet rs = stmt.executeQuery("SELECT product_id, product_name, brand_id, brand_name, price, screen_type, size_category, usage_category FROM mobile_product_master WHERE product_id like 'No0%'");
Directory storeIndexDirectory = FSDirectory.open(new File("E:\\index_dir"));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_40, new StandardAnalyzer(Version.LUCENE_40));
while(rs.next())
{
productId = rs.getString("product_id");
productName = rs.getString("product_name");
brandId = rs.getString("brand_id");
brandName = rs.getString("brand_name");
price = rs.getString("price");
screenType = rs.getString("screen_type");
sizeCategory = rs.getString("size_category");
usageCategory = rs.getString("usage_category");
//doc = new Document(new Field());
doc = new Document();
doc.add(new Field("product_id",productId,Store.YES,Index.NO));
doc.add(new Field("product_name",productName,Store.YES,Index.NO));
doc.add(new Field("brand_id",brandId,Store.YES,Index.NO));
doc.add(new Field("brand_name",brandName,Store.YES,Index.NO));
doc.add(new Field("price",price,Store.YES,Index.NO));
doc.add(new Field("screen_type",screenType,Store.YES,Index.NO));
doc.add(new Field("size_category",sizeCategory,Store.YES,Index.NO));
doc.add(new Field("usage_category",usageCategory,Store.YES,Index.NO));
indexWriter = new IndexWriter(storeIndexDirectory, indexWriterConfig);
indexWriter.addDocument(doc);
indexWriter.close();
doc = null;
}
Code snippet for search:
String queryString = arg[0];
Directory storeIndexDirectory = FSDirectory.open(new File("E:\\index_dir"));
IndexReader indexReader = IndexReader.open(storeIndexDirectory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser parser = new QueryParser(Version.LUCENE_40,"product_id",new StandardAnalyzer(Version.LUCENE_40));
Query query = parser.parse(queryString);
TopDocs topDocs = indexSearcher.search(query,1000);
ScoreDoc[] hits = topDocs.scoreDocs;
System.out.println(hits.length);
for(int i=0;i < hits.length; i++)
{
int docId = hits[i].doc;
Document d = indexSearcher.doc(docId);
System.out.println(d.get("product_id") + "," + d.get("product_name") + "," + d.get("brand_id") + "," + d.get("brand_name") + "," + d.get("price") + "," + d.get("screen_type") + "," + d.get("size_category") + "," + d.get("usage_category"));
}
I am not able to locate the error in search or indexing part.
With Lucene if you want that your field is "searchable" you must create a field with Index.YES.
In your example all new Field(...) statements have Index.NO parameter.
Change it to Index.YES only for a field you want to search.
You can also use TextField instead of generic Field with Index.YES.
Issue is resolved now. I used Index.ANALYZED while creating a field[adding to document] instead of using Index.NO. As SRS has pointed out, Index.YES would also work.
This raises a new question to me; In Lucene, I have to mark Index.YES/Index.ANALYZED to make the field searchable. So what is the case where someone would want a field to be created with searchable disabled? We use Lucene , store docs/fields to search it so in which use case do we use Index.No?!. Thanks.
Related
I am trying to do autocomplete using lucene search functionality. I have the following code which searches by the query prefix but along with that it also gives me all the sentences containing that word while I want it to display only sentence or word starting exactly with that prefix.
ex: m
--holiday mansion houseboat
--eye muscles
--movies of all time
--machine
I want it to show only last 2 queries. How to do it am stucked here also I am new to lucene. Please can any one help me in this. Thanks in advance.
addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
// use a string field for isbn because we don't want it tokenized
doc.add(new Field("isbn", isbn, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
Main:
try {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = FSDirectory.open(new File(indexDir));
IndexWriter writer = new IndexWriter(index, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); //3
for (int i = 0; i < source.size(); i++) {
addDoc(writer, source.get(i), + (i + 1) + "z");
}
writer.close();
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery query = new PrefixQuery(term);
// 3. search
int hitsPerPage = 20;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("title"));
}
reader.close();
} catch (Exception e) {
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
}
}
I see two solutions:
as suggested by Yahnoosh, save the title field twice, Once as TextField (=analyzed) and once as StringField (not analyzed)
save it just as TextField, but When Querying use SpanFirstQuery
// 2. query
Term term = new Term("title", querystr);
//create the term query object
PrefixQuery pq = new PrefixQuery(term);
SpanQuery wrapper = new SpanMultiTermQueryWrapper<PrefixQuery>(pq);
Query final = new SpanFirstQuery(wrapper, 1);
If I understand your scenario correctly, you want to autocomplete on the title field.
The solution is to have two fields: one analyzed, to enable querying over it, one non-analyzed to have titles indexed without breaking them into individual terms.
Your autocomplete logic should issue prefix queries against the non-analyzed field to match only on the first word. Your term queries should be issued against the analyzed field for matches within the title.
I hope that makes sense.
I want to delete a document in apache lucene having exact match only. for example I have documents containing text:
Document1: Bilal
Document2: Bilal Ahmed
Doucument3: Bilal Ahmed - 54
And when Try to remove the document with query 'Bilal' it deletes all these three documents while it should delete just first document with exact match.
The Code I use is this:
String query = "bilal";
String field = "userNames";
Term term = new Term(field, query);
IndexWriter indexWriter = null;
File indexDir = new File(idexedDirectory);
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
indexWriter = new IndexWriter(directory, iwc);
indexWriter.deleteDocuments(term);
indexWriter.close();
This is how I am indexing my documents:
File indexDir = new File("C:\\Local DB\\TextFiled");
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);
//Thirdly We tell the Index Writer that which document to index
indexWriter = new IndexWriter(directory, iwc);
int i = 0;
try (DataSource db = DataSource.getInstance()) {
PreparedStatement ps = db.getPreparedStatement(
"SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);
try (ResultSet resultSet = ps.executeQuery()) {
while (resultSet.next()) {
i++;
doc = new Document();
text = resultSet.getString("username");
doc.add(new StringField("userNames", text, Field.Store.YES));
indexWriter.addDocument(doc);
System.out.println("User Name : " + text + " : " + userID);
}
}
You have missed to provide how you index those documents. If they are indexed using StandardAnalyzer and tokenization is on, it is understandable that you get these results - this is because StandardAnalyzer tokenizes the text for each word and since each of your documents contains Bilal, you hit all those documents as a result.
The general advice is that you should always add a unique id field and query/delete by this id field.
If you can't do this - index the same text as a separate field - without tokenization - and use phrase query to find the exact match, but this sounds like a horrible hack to me.
Hi I am java developer and learning Lucene. I have a java class that index a pdf(lucene_in_action_2nd_edition.pdf) file and a search class that search some text from index. IndexSearcher is giving Document which shows that string exists in index(lucene_in_action_2nd_edition.pdf) or not.
But now I want to get searched data or metadata. i.e. I want to know that at which page string is matched, or few text around matched string, etc... How to do that?
Here is my LuceneSearcher.java class:
public static void main(String[] args) throws Exception {
File indexDir = new File("D:\\index");
String querystr = "Advantages of FastVectorHighlighter";
Query q = new QueryParser(Version.LUCENE_40, "contents",
new StandardAnalyzer(Version.LUCENE_40)).parse(querystr);
int hitsPerPage = 100;
IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + "... " + d.get("filename"));
System.out.println("=====================================================");
System.out.println(d.get("contents"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
Here d.get("contents") give full text(generated by Tika) of .pdf file, that was stored at time of indexing.
I want some information about searched text, so that I can show that on my web page or highlight searched text properly(like google search output). How to achieve that? Do we need to write some logic or Lucene does it internally?
Any type of help would be appreciated. Thanks in advance.
The org.apache.lucene.search.highlight package provides this functionality.
Such as:
SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
String text = doc.get("contents");
String bestFrag = highlighter.getBestFragment(analyzer, "contents", text);
//output, however you like.
You can also get a list of best Fragments from the highlighter, instead of just a single one, if you prefer, see the Highlighter API
I have a web application which stores customers usernames, emails and phone numbers.
I want customers to search for other users using email, phone or username for a start just to understand the whole lucene concept. then later on i will add functionality to search within a user an item he posts. I am following this example on www.lucenetutorial.com/lucene-in-5-minutes.html
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
I want new customers to be added to index automatically on registration. customerId is timestamp. so should i add a new document for each field on the customers details or should i concatenate all fields into a string and add as a single document? Please go easy on me I am really new.
This is a good place to start with Lucene indexing mechanism
http://www.ibm.com/developerworks/library/wa-lucene/
In the bottom line when lucene index the document, it first converts it into lucene document form. This lucene document comprises of set of fields and each field is a set of terms. Term are nothing but stream of bytes.
The document which is to be index to pass to analyzer which forms these terms out of it, and these terms keywords which are match during searching process.
When we perform a search process the query is analyzed through the same analyzer and then is match against the terms.
So you dont have to create a document for each field, rather you should create a single document for each user.
I'm using lucene 4.0 with java. I'm trying to search for a string inside a string. If we look at the lucene hello world example, I wish to find the text "lucene" inside the phrase "inLuceneAction". I want it to find me two matches in this case instead of one.
Any Idea on how to do it?
Thanks
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "inLuceneAction", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
If you index the terms in the default way, meaning inLuceneAction is one term, Lucene won't be able to seek to this term given Lucene because it has a different prefix. Analyze this string so that it results in three indexed terms: in Lucene Action and then you'll have it fetched. You'll either find a ready-made analyzer for this or you'll have to write your own. Writing own analyzers is a bit out of scope for a single StackOverflow answer, but an excellent place to start is the package info at the bottom of the org.apache.lucene.analysis package Javadoc page.