Apache Lucene - Creating and Storing an Index? - java

This post if a follow-on from my previous question:
Apache Lucene - Optimizing Searching
I want to create an index from title stored in my database, store the index on the server from which I am running my web application, and have that index available to all users who are using the search feature on the web application.
I will update the index when a new title is added, edited or deleted.
I cannot find a tutorial to do this in Apache Lucene, so can anyone help me code this in Java (using Spring).

From my understanding to your question, you need to do the following :
1) Index you data (titles in your case)
first you need to implement the code that create that index for you data, check this sample of code.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
// Store the index in memory:
//Directory directory = new RAMDirectory();
Store an index on disk
Directory directory = FSDirectory.open(indexfilesDirPathOnYourServer);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
String title = getTitle();
doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
here you need to loop over you all data.
2) Search for you indexed data.
you can search for you data by using this code:
DirectoryReader ireader = DirectoryReader.open(indexfilesDirPathOnYourServer);
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "fieldname", analyzer);//note here we used the same analyzer object
Query query = parser.parse("test");//test is am example for a search query
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println(hitDoc.get("fieldname"));
}
ireader.close();
directory.close();
Note : here you don't have to fetch all the data from your DB, you can directly get it from index. also you don't have to re-create the whole index each time user search or fetch the data, you can update the title from time to time when you add/update or delete one by one (the title that have been updated or deleted not the whole indexed titles).
to update index use :
Term keyTerm = new Term(KEY_FIELD, KEY_VALUE);
iwriter.updateDocument(keyTerm, updatedFields);
to delete index use :
Term keyTerm = new Term(KEY_FIELD, KEY_VALUE);
iwriter.deleteDocuments(keyTerm);
Hope that help you.

Related

Lucene Index Query does not find document if too many documents/similar documents present

If I create documents as such:
{
Document document = new Document();
document.add(new TextField("id", "10384-10735", Field.Store.YES));
submitDocument(document);
}
{
Document document = new Document();
document.add(new TextField("id", "10735", Field.Store.YES));
submitDocument(document);
}
for (int i = 20000; i < 80000; i += 123) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
submitDocument(otherDoc1);
Document otherDoc2 = new Document();
otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
submitDocument(otherDoc2);
}
meaning:
one with an id of 10384-10735
one with an id of 10735 (which is the last part of the previous document ID)
and 975 other documents with pretty much any ID
and then write them using:
final IndexWriterConfig luceneWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
luceneWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
final IndexWriter luceneDocumentWriter = new IndexWriter(luceneDirectory, luceneWriterConfig);
for (Map.Entry<String, Document> indexDocument : indexDocuments.entrySet()) {
final Term term = new Term(Index.UNIQUE_LUCENE_DOCUMENT_ID, indexDocument.getKey());
indexDocument.getValue().add(new TextField(Index.UNIQUE_LUCENE_DOCUMENT_ID, indexDocument.getKey(), Field.Store.YES));
luceneDocumentWriter.updateDocument(term, indexDocument.getValue());
}
luceneDocumentWriter.close();
Now that the index is written, I want to perform a query, searching for the document with the ID 10384-10735.
I will be doing this in two ways, using the TermQuery and a QueryParser with the StandardAnalyzer:
System.out.println("term query: " + index.findDocuments(new TermQuery(new Term("id", "10384-10735"))));
final QueryParser parser = new QueryParser(Index.UNIQUE_LUCENE_DOCUMENT_ID, new StandardAnalyzer());
System.out.println("query parser: " + index.findDocuments(parser.parse("id:\"10384 10735\"")));
In both cases, I would expect the document to appear. This is the result if I run the queries however:
term query: []
query parser: []
which seems odd. I experimented around a bit further and found out that if I either reduce the amount of documents OR remove the entry 10735, the query parser query now successfully finds the document:
term query: []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]
meaning this works:
{
Document document = new Document();
document.add(new TextField("id", "10384-10735", Field.Store.YES));
submitDocument(document);
}
for (int i = 20000; i < 80000; i += 123) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
submitDocument(otherDoc1);
Document otherDoc2 = new Document();
otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
submitDocument(otherDoc2);
}
and this works (490 documents)
{
Document document = new Document();
document.add(new TextField("id", "10384-10735", Field.Store.YES));
submitDocument(document);
}
{
Document document = new Document();
document.add(new TextField("id", "10735", Field.Store.YES));
submitDocument(document);
}
for (int i = 20000; i < 50000; i += 123) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField("id", String.valueOf(i), Field.Store.YES));
submitDocument(otherDoc1);
Document otherDoc2 = new Document();
otherDoc2.add(new TextField("id", i + "-" + (i + 67), Field.Store.YES));
submitDocument(otherDoc2);
}
Does somebody know what causes this? I really need the index to consistently find the documents. I'm fine with using the QueryParser and not the TermQuery.
I use 9.3.0 lucene-core and lucene-queryparser.
Thank you for your help in advance.
Edit 1: This is the code in findDocuments():
final TopDocs topDocs = getIndexSearcher().search(query, Integer.MAX_VALUE);
final List<Document> documents = new ArrayList<>((int) topDocs.totalHits.value);
for (int i = 0; i < topDocs.totalHits.value; i++) {
documents.add(getIndexSearcher().doc(topDocs.scoreDocs[i].doc));
}
return documents;
Edit 2: here is a working example: https://pastebin.com/Ft0r8pN5
for some reason, the issue with the too many documents does not happen in this one, which I will look into. I still left it in for the example. This is my output:
[similar id: true, many documents: true]
Indexing [3092] documents
term query: []
query parser: []
[similar id: true, many documents: false]
Indexing [654] documents
term query: []
query parser: []
[similar id: false, many documents: true]
Indexing [3091] documents
term query: []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]
[similar id: false, many documents: false]
Indexing [653] documents
term query: []
query parser: [Document<stored,indexed,tokenized<id:10384-10735> stored,indexed,tokenized<uldid:10384-10735>>]
As you can see, if the document with the ID 10735 is added to the documents, the document cannot be found anymore.
Summary
The problem is caused by a combination of (a) the order in which your documents are processed; and (b) the fact that updateDocument first deletes and then inserts data in the index.
When you use writer.updateDocument(term, document), Lucene performs an atomic delete-then-add:
Updates a document by first deleting the document(s) containing term and then adding the new document.
In your case, the order in which documents are processed is based on how they are retrieved from your Java Map - and that is based on how the entries are hashed by the map.
As you note in your answer, you already have a way to avoid this by using your Java object hashes as the updateDocument terms. (As long as you don't get any hash collisions.)
This answer attempts to explain the "why" behind the results you are seeing.
Basic Demonstration
This is a highly simplified version of your code.
Consider the following two Lucene documents:
final Document documentA = new Document();
documentA.add(new TextField(FIELD_NAME, "10735", Field.Store.YES));
final Term termA = new Term(FIELD_NAME, "10735");
writer.updateDocument(termA, documentA);
final Document documentB = new Document();
documentB.add(new TextField(FIELD_NAME, "10384-10735", Field.Store.YES));
final Term termB = new Term(FIELD_NAME, "10384-10735");
writer.updateDocument(termB, documentB);
documentA then documentB:
Lucene has nothing to delete when documentA is added. After the doc is added, the index contains the following:
field id
term 10735
doc 0
freq 1
pos 0
So, we have only one token 10735.
For documentB, there are no documents in the index containing the term 10384-10735 - and therefore nothing is deleted prior to documentB being added to the index.
We end up with the following final indexed data:
field id
term 10384
doc 1
freq 1
pos 0
term 10735
doc 0
freq 1
pos 0
doc 1
freq 1
pos 1
When we search for 10384, we get one hit, as expected.
documentB then documentA:
If we swap the order in which the 2 documents are processed, we see the following after documentB is indexed:
field id
term 10384
doc 0
freq 1
pos 0
term 10735
doc 0
freq 1
pos 1
When documentA is indexed, Lucene finds that doc 0 (above) in the index does contain the term 10735 used by documentA. Therefore all of the doc 0 entries are deleted from the index, before documentA is added.
We end up with the following indexed data (basically, a new doc 0, after the original doc 0 was deleted):
field id
term 10735
doc 0
freq 1
pos 0
Now when we search for 10384, we get zero hits - not what we expected.
More Complicated Demonstration
Things are made more complicated in your scenario in the question by your use of a Java Map to collect the documents to be indexed. This causes the order in which your Lucene documents are indexed to be different from the order in which they are created, due to hashing performed by the map.
Here is another simplified version of your code, but this time it uses a map:
public class MyIndexBuilder {
private static final String INDEX_PATH = "index";
private static final String FIELD_NAME = "id";
private static final Map<String, Document> indexDocuments = new HashMap<>();
public static void buildIndex() throws IOException, FileNotFoundException, ParseException {
final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE);
//iwc.setCodec(new SimpleTextCodec());
try ( IndexWriter writer = new IndexWriter(dir, iwc)) {
String suffix = "10429";
Document document1 = new Document();
document1.add(new TextField("id", "10001-" + suffix, Field.Store.YES));
indexDocuments.put("10001-" + suffix, document1);
Document document2 = new Document();
document2.add(new TextField("id", suffix, Field.Store.YES));
indexDocuments.put(suffix, document2);
//int max = 10193; // OK
int max = 10192; // not OK
for (int i = 10003; i <= max; i += 1) {
Document otherDoc1 = new Document();
otherDoc1.add(new TextField(FIELD_NAME, String.valueOf(i), Field.Store.YES));
indexDocuments.put(String.valueOf(i), otherDoc1);
}
System.out.println("Total docs: " + indexDocuments.size());
for (Map.Entry<String, Document> indexDocument : indexDocuments.entrySet()) {
if (indexDocument.getKey().contains(suffix)) {
// show the order in which the document1 and document2 are indexed:
System.out.println(indexDocument.getKey());
}
final Term term = new Term(FIELD_NAME, indexDocument.getKey());
writer.updateDocument(term, indexDocument.getValue());
}
}
}
}
In addition to the two documents we are interested in, I add 191 additional (completely unrelated) documents to the index.
When I process the map, I see the following output:
Total docs: 193
10429
10001-10429
So, document2 is indexed before document1 - and our search for 10001 finds one hit.
But if I process fewer of these "extra" documents (190 instead of 191):
int max = 10192; // not OK
...then I get this output:
Total docs: 192
10001-10429
10429
You can see that the order in which document1 and document2 are processed has been flipped - and now that same search for 10001 finds zero hits.
A seemingly unrelated change (procesing one fewer document) has caused the retrieval order from the map to change, causing the indexed data to be different.
(I was incorrect in one of my comments in the question, when I noted that the indexed data was apparently identical. It is not the same. I missed that when I was first looking at the indexed data.)
Recommendation
Consider adding a new field to your Lucene documents, for storing each document's unique identifier.
You could call it doc_id and it would be created as a StringField, not as a TextField.
This would ensure that the contents of this field are never processed by the Standard Analyzer and are stored in the index as a single (presumably unique) token. A StringField is indexed but not tokenized.
You can then use this field when building your term to use in the updateDocument() method. And you can use the existing id field for searches.
At a first glance, a possible solution for this would be:
The updateDocument() method with a term passed as first parameter is currently used to build the index. When either passing null as term or using the addDocument() method, the query successfully returned the correct values. The solution must have something to do with the Term.
luceneDocumentWriter.addDocument(indexDocument.getFields());
// or
luceneDocumentWriter.updateDocument(null, indexDocument);
Playing around a bit further: the key of the term the document in question is stored under cannot be used as field key inside the document again, otherwise the document becomes unsearchable:
final Term term = new Term("uldid", indexDocument.get("id"));
// would work, different key from term...
indexDocument.add(new TextField("uldid2", indexDocument.get("id"), Field.Store.YES));
// would not work...
indexDocument.add(new TextField("uldid", indexDocument.get("id"), Field.Store.YES));
// ...when adding to index using term
luceneDocumentWriter.updateDocument(term, indexDocument);
Another way to circumvent this would be to use a different value from the identical field in the document (uldid in this case), that is also different from the ID that is being searched in the index:
final Term term = new Term("uldid", indexDocument.get("id").hashCode() + "");
// or
indexDocument.add(new TextField("uldid", indexDocument.get("id").hashCode() + "", Field.Store.YES));
Which seems rather odd. I don't really have a final solution or reason this is the way it is, but I will be using the second option from now on, using the hash of whatever key I want to store the document under as Term.

PhraseQuery is not working in Apache lucene 7.2.1

I am new to the Apache Lucene. I am using the Apache Lucene v7.2.1.
I need to do a phrase search in a huge file. I first made a sample code to figure out phrase search functionality in the Lucene using PhraseQuery. But it does not work.
My code is given below:
public class LuceneExample
{
private static final String INDEX_DIR = "myIndexDir";
// function to create index writer
private static IndexWriter createWriter() throws IOException
{
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
IndexWriter writer = new IndexWriter(dir, config);
return writer;
}
// function to create the index document.
private static Document createDocument(Integer id, String source, String target)
{
Document document = new Document();
document.add(new StringField("id", id.toString() , Store.YES));
document.add(new TextField("source", source , Store.YES));
document.add(new TextField("target", target , Store.YES));
return document;
}
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception
{
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
String[] words = source.split(" ");
int ii = 0;
for (String word : words) {
builder.add(new Term("source", word), ii);
ii = ii + 1;
}
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
public static void main(String[] args) throws Exception
{
// TODO Auto-generated method stub
// create index writer
IndexWriter writer = createWriter();
//create documents object
List<Document> documents = new ArrayList<>();
String src = "Negotiation Skills are focused on resolving differences for the benefit of an individual or a group , or to satisfy various interests.";
String tgt = "Modified target : Negotiation Skills are focused on resolving differences for the benefit of an individual or a group, or to satisfy various interests.";
Document d1 = createDocument(1, src, tgt);
documents.add(d1);
src = "This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
tgt = "Modified target : This point may benefit all of the participating entities, or just a single party, some of them, or all of them.";
Document d2 = createDocument(2, src, tgt);
documents.add(d2);
writer.deleteAll();
// adding documents to index writer
writer.addDocuments(documents);
writer.commit();
writer.close();
// for index searching
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
//Search by source
TopDocs foundDocs = searchBySource("benefit of an individual", searcher);
System.out.println("Total Results count :: " + foundDocs.totalHits);
}
}
When I searched for the string "benefit of an individual" as mentioned above. The Total Results count comes as '0' . But it is present in the document1. It would be great if I could get any help in resolving this issue.
Thanks in advance.
Let's start from the summary:
at index time you are using Standard analyzer with English stop words
at query time you are using your own analysis without stop words and special characters removal
There is a rule use the same analysis chain at index and query time.
Here is an example of a simplified and "correct" query processing:
// function to do index search by source
private static TopDocs searchBySource(String source, IndexSearcher searcher) throws Exception {
// phrase query build
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new StandardAnalyzer().tokenStream("source", source);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("source", charTermAttribute.toString()));
}
tokenStream.end();
tokenStream.close();
builder.setSlop(2);
PhraseQuery query = builder.build();
System.out.println(query);
// phrase search
TopDocs hits = searcher.search(query, 10);
return hits;
}
In sake of simplicity we can remove stop words from Standard analyzer, by using constructor with empty stop words list - and everything will be simple as you expected. You can read more about stop words and phrase queries here.
All the problems with phrase queries are started from stop words. Under the hood Lucene keeps positions of all words including stop words in a special index -
term positions. It is useful in some cases to separate "the goal" and "goal". In case of phrase query - it tries to take into account term positions. For example, we have a term "black and white" with a stop word "and". In this case Lucene index will have two terms "black" with position 1 and "white" with position 3. Naive phrase query "black white" should not match anything because it does not allow gap in terms positions. There are two possible strategies to create the right query:
"black ? white" - uses special marker for every stop word. This will match "black and white" and "black or white"
"black white"~1 - allows to match with gap in terms positions. "black or white" is also possible. With slop 2 and more "white and black" is also possible.
In order to create the right query you can use the following term attribute at query processing:
PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
I've used setSlop(2) in order to simplify a code snippet, you can set slop factor based on query length or put correct positions of terms in phrase builder. My recommendation is not to use stop words, you can read about stop words here.

Getting the termfrequncies by corresponding order of documents were indexed

I have collection of documents (say 10 documents)and i'm indexing them this way, by storing the term vector
StringReader strRdElt = new StringReader(content);
Document doc = new Document();
String docname=docNames[docNo];
doc.add(new Field("doccontent", strRdElt, Field.TermVector.YES));
IndexWriter iW;
try {
NIOFSDirectory dir = new NIOFSDirectory(new File(pathToIndex)) ;
iW = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_35,
new StandardAnalyzer(Version.LUCENE_35)));
iW.addDocument(doc);
iW.close();
}
After Index all the documents, i'm getting the termfrequencies of each document this way
IndexReader re = IndexReader.open(FSDirectory.open(new File(pathToIndex)), true) ;
TermFreqVector termsFreq[]; //size of number of documents
for(int i=0;i<noOfDocs;i++){
termsFreq[i] = re.getTermFreqVector(i, "doccontent");
}
my problem is i'm not getting the termfreqncy vector correspondingly. Say for 2nd document that I have indexed i'm getting it's corresponding termfrequncies and terms at "termsFreq[9]"
What is the reason for that?, how can I get the corresponding termfrequncies by the order that I have indexed the documents?

Lucene: Multi-word phrases as search terms

I'm trying to make a searchable phone/local business directory using Apache Lucene.
I have fields for street name, business name, phone number etc. The problem that I'm having is that when I try to search by street where the street name has multiple words (e.g. 'the crescent'), no results are returned. But if I try to search with just one word, e.g 'crescent', I get all the results that I want.
I'm indexing the data with the following:
String LocationOfDirectory = "C:\\dir\\index";
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
Directory Index = new SimpleFSDirectory(LocationOfDirectory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE.34, analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Analyzed);
w.add(doc);
w.close();
My searches work like this:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
I have tried swapping the wildcard query for a phrase query, first with the entire string and then splitting the string up on white space and wrapping them in a BooleanQuery like this:
String term = "the crescent";
BooleanQuery b = new BooleanQuery();
PhraseQuery p = new PhraseQuery();
String[] tokens = term.split(" ");
for(int i = 0 ; i < tokens.length ; ++i)
{
p.add(new Term("Street", tokens[i]));
}
b.add(p, BooleanClause.Occur.MUST);
However, this didn't work. I tried using a KeywordAnalyzer instead of a StandardAnalyzer, but then all other types of search stopped working as well. I have tried replacing spaces with other characters (+ and #), and converting queries to and from this form, but that still doesn't work. I think it doesn't work because + and # are special characters which are not indexed, but I can't seem to find a list anywhere of which characters are like that.
I'm beginning to go slightly mad, does anyone know what I'm doing wrong?
The reason why you don't get your documents back is that while indexing you're using StandardAnalyzer, which converts tokens to lowercase and removes stop words. So the only term that gets indexed for your example is 'crescent'. However, wildcard queries are not analyzed, so 'the' is included as mandatory part of the query. The same goes for phrase queries in your scenario.
KeywordAnalyzer is probably not very suitable for your use case, because it takes whole field content as a single token. You can use SimpleAnalyzer for the street field -- it will split the input on all non-letter characters and then convert them to lowercase. You can also consider using WhitespaceAnalyzer with LowerCaseFilter. You need to try different options and work out what works best for your data and users.
Also, you can use different analyzers per field (e.g. with PerFieldAnalyzerWrapper) if changing analyzer for that field breaks other searches.
I found that my attempt to generate a query without using a QueryParser was not working, so I stopped trying to create my own queries and used a QueryParser instead. All of the recomendations that I saw online showed that you should use the same Analyzer in the QueryParser that you use during indexing, so I used a StandardAnalyzer to build the QueryParser.
This works on this example because the StandardAnalyzer removes the word "the" from the street "the crescent" during indexing, and hence we can't search for it because it isn't in the index.
However, if we choose to search for "Grove Road", we have a problem with the out-of-the-box functionality, namely that the query will return all of the results containing either "Grove" OR "Road". This is easily fixed by setting up the QueryParser so that it's default operation is AND instead of OR.
In the end, the correct solution was the following:
int numberOfHits = 200;
String LocationOfDirectory = "C:\\dir\\index";
TopScoreDocCollector collector = TopScoreDocCollector.create(numberOfHits, true);
Directory directory = new SimpleFSDirectory(new File(LocationOfDirectory));
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
//WildcardQuery q = new WildcardQuery(new Term("Street", "the crescent");
QueryParser qp = new QueryParser(Version.LUCENE_35, "Street", analyzer);
qp.setDefaultOperator(QueryParser.Operator.AND);
Query q = qp.parse("grove road");
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
#RikSaunderson's solution for searching documents where all subqueries of a query have to occur, is still working with Lucene 9.
QueryParser queryParser = new QueryParser(LuceneConstants.CONTENTS, new StandardAnalyzer());
queryParser.setDefaultOperator(QueryParser.Operator.AND);
If you want an exact words match the street, you could set Field "Street" NOT_ANALYZED which will not filter stop word "the".
doc.add(new Field("Street", "the crescent", Field.Store.YES, Field.Index.Not_Analyzed);
There is no need of using any Analyzer here coz Hibernate implicitly uses StandardAnalyzer which will split the words based on white spaces so the solution here is set the Analyze to NO it will automatically performs Multi Phrase Search
#Column(name="skill")
#Field(index=Index.YES, analyze=Analyze.NO, store=Store.NO)
#Analyzer(definition="SkillsAnalyzer")
private String skill;

How do I index and search text files in Lucene 3.0.2?

I am newbie in Lucene, and I'm having some problems creating simple code to query a text file collection.
I tried this example, but is incompatible with the new version of Lucene.
UDPATE: This is my new code, but it still doesn't work yet.
Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).
Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:
Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");
If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.
You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:
Directory directory = new RAMDirectory(); // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index")); // disk index storage
When you want to add, update or delete document, you need IndexWriter:
IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));
Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:
Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED)); // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc); // writing new document to the index
Field constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED for any field you are going to search on.
Normally, you use both parameters as shown above.
Don't forget to close your IndexWriter after the job is done:
writer.close();
Searching is a bit tricky. You will need several classes: Query and QueryParser to make Lucene query from the string, IndexSearcher for actual searching, TopScoreDocCollector to store results (it is passed to IndexSearcher as a parameter) and ScoreDoc to iterate through results. Next snippet shows how this all is composed:
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc); // getting actual document
System.out.println("Title: " + hitDoc.get("title"));
System.out.println("Content: " + hitDoc.get("content"));
System.out.println();
}
Note second argument to the QueryParser constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParser also takes analyzer as a last argument. This must be same analyzer as you used to index your text.
The last thing you must know is a TopScoreDocCollector.create first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.
Finally, don't forget to close searcher and directory to not loose system resources:
searcher.close();
directory.close();
EDIT: Also see IndexFiles demo class from Lucene 3.0 sources.
package org.test;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
public class LuceneSimple {
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
public static void main(String[] args) throws CorruptIndexException, LockObtainFailedException, IOException, ParseException {
File dir = new File("F:/tmp/dir");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
Directory index = new RAMDirectory();
//Directory index = FSDirectory.open(new File("lucDirHello") );
IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
w.setRAMBufferSizeMB(200);
System.out.println(index.getClass() + " RamBuff:" + w.getRAMBufferSizeMB() );
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
addDoc(w, "Computer Science ! what is that ?");
Long N = 0l;
for( File f : dir.listFiles() ){
BufferedReader br = new BufferedReader( new FileReader(f) );
String line = null;
while( ( line = br.readLine() ) != null ){
if( line.length() < 140 ) continue;
addDoc(w, line);
++N;
}
br.close();
}
w.close();
// 2. query
String querystr = "Computer";
Query q = new QueryParser( Version.LUCENE_30, "title", analyzer ).parse(querystr);
//search
int hitsPerPage = 10;
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
searcher.close();
}
}
I suggest you look into Solr # http://lucene.apache.org/solr/ rather than working with lucene api

Categories