how to search a file with lucene

how to search a file with lucene - java

I want to do a search for a query within a file "fdictionary.txt" containing a list of words (230,000 words) written line by line. any suggestion why this code is not working?
The spell checking part is working and gives me the list of suggestions (I limited the length of the list to 1). what I want to do is to search that fdictionary and if the word is already in there, do not call spell checking. My Search function is not working. It does not give me error! Here is what I have implemented:
public class SpellCorrection {
public static File indexDir = new File("/../idxDir");
public static void main(String[] args) throws IOException, FileNotFoundException, CorruptIndexException, ParseException {
Directory directory = FSDirectory.open(indexDir);
SpellChecker spell = new SpellChecker(directory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_20, null);
File dictionary = new File("/../fdictionary00.txt");
spell.indexDictionary(new PlainTextDictionary(dictionary), config, true);
String query = "red"; //kne, console
String correctedQuery = query; //kne, console
if (!search(directory, query)) {
String[] suggestions = spell.suggestSimilar(query, 1);
if (suggestions != null) {correctedQuery=suggestions[0];}
}
System.out.println("The Query was: "+query);
System.out.println("The Corrected Query is: "+correctedQuery);
}
public static boolean search(Directory directory, String queryTerm) throws FileNotFoundException, CorruptIndexException, IOException, ParseException {
boolean isIn = false;
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
Term term = new Term(queryTerm);
Query termQuery = new TermQuery(term);
TopDocs hits = indexSearcher.search(termQuery, 100);
System.out.println(hits.totalHits);
if (hits.totalHits > 0) {
isIn = true;
}
return isIn;
}
}

where are you indexing the content from fdictionary00.txt?
You can search using IndexSearcher, only when you have index. If you are new to lucene, you might want to check some quick tutorials. (like http://lucenetutorial.com/lucene-in-5-minutes.html)

You never built the index.
You need to setup the index...
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
IndexWriter writer = new IndexWriter(directory,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED );
You then need to create a document and add each term to the document as an analyzed field..
Document doc = new Document();
doc.Add(new Field("name", word , Field.Store.YES, Field.Index.ANALYZED));
Then add the document to the index
writer.AddDocument(doc);
writer.Optimize();
Now build the index and close the index writer.
writer.Commit();
writer.Close();

You could make your SpellChecker instance available in a service and use spellChecker.exist(word).
Be aware that the SpellChecker will not index words 2 characters or less. To get around this you can add them to the index after you have created it (add them into SpellChecker.F_WORD field).
If you want to add to your live index and make them available for exist(word) then you will need to add them to the SpellChecker.F_WORD field. Of course, because you're not adding to all the other fields such as gram/start/end etc then your word will not appear as a suggestion for other misspelled words.
In this case you'd have had to add the word into your file so when you re-create the index it would then be available as a suggestion. It would be great if the project made SpellChecker.createDocument(...) public/protected, rather than private, as this method accomplishes everything with adding words.
After all this your need to call spellChecker.setSpellIndex(directory).

Related

Unable to identify error in Lucene MoreLikeThis

I need to use Lucene MoreLikeThis to find similar documents given a paragraph of text. I am new to Lucene and followed the code here
I have already indexed the documents at the directory - "C:\Users\lucene_index_files\v2"
I am using "They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP." as the document to which I want to find similar documents.
public class LuceneSearcher2 {
public static void main(String[] args) throws IOException {
LuceneSearcher2 m = new LuceneSearcher2();
System.out.println("1");
m.start();
System.out.println("2");
//m.writerEntries();
m.findSilimar("They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP.");
System.out.println("3");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void start() throws IOException{
//analyzer = new StandardAnalyzer(Version.LUCENE_42);
//config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
analyzer = new StandardAnalyzer();
config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //don't write on disk
//https://stackoverflow.com/questions/36542551/lucene-in-java-method-not-found?rq=1
indexDir = FSDirectory.open(FileSystems.getDefault().getPath("C:\\Users\\lucene_index_files\\v2")); //write on disk
//System.out.println(indexDir);
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
System.out.println("2a");
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
System.out.println("2b");
StringReader sReader = new StringReader(searchForSimilar);
//Query query = mlt.like(sReader, null);
//Throws error - The method like(String, Reader...) in the type MoreLikeThis is not applicable for the arguments (StringReader, null)
Query query = mlt.like("computer");
System.out.println("2c");
System.out.println(query.toString());
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
System.out.println("2d");
}}
I am unsure as to what is causing the system to not generate an output/

What is your output ? I am assuming your not finding similar documents. The reason could be that the query you are creating is empty.
First of all to run your code in a meaningful way this line
Query query = mlt.like(sReader, null);
needs a String[] of field names as the argument, so it should work like this
Query query = mlt.like(sReader, new String[]{"title", "content"});
Now, in order to use MoreLikeThis in Lucene, your stored Fields have to have the set the option to store term vectors "setStoreTermVectors(true);" true when creating fields, for instance like this:
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setStoreTermVectors(true);
fieldType.setTokenized(true);
Field contentField = new Field("contents", this.getBlurb(), fieldType);
doc.add(contentField);
Leaving this out could result in an empty query string and consequently no results for the query

"TokenStream contract violation: close() call missing" when calling addDocument

I'm using Lucene's features to build a simple way to match similar words within a text.
My idea is to have have an Analyzer running on my text to provide a TokenStream, and for each token I run a FuzzyQuery to see if I have a match in my index. If not I just index a new Document containing just the new unique word.
Here's what I'm getting tho:
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:411)
at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:165)
at org.apache.lucene.document.Field.tokenStream(Field.java:568)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:708)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1562)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
at org.myPackage.MyClass.addToIndex(MyClass.java:58)
Relevant code here:
// Setup tokenStream based on StandardAnalyzer
TokenStream tokenStream = analyzer.tokenStream(TEXT_FIELD_NAME, new StringReader(input));
tokenStream = new StopFilter(tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter(tokenStream, 3);
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
...
// Iterate and process each token from the stream
while (tokenStream.incrementToken()) {
CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
processWord(charTerm.toString());
}
...
// Processing a word means looking for a similar one inside the index and, if not found, adding this one to the index
void processWord(String word) {
...
if (DirectoryReader.indexExists(index)) {
reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs searchResults = searcher.search(query, 1);
if (searchResults.totalHits > 0) {
Document foundDocument = searcher.doc(searchResults.scoreDocs[0].doc);
super.processWord(foundDocument.get(TEXT_FIELD_NAME));
} else {
addToIndex(word);
}
} else {
addToIndex(word);
}
...
}
...
// Create a new Document to index the provided word
void addWordToIndex(String word) throws IOException {
Document newDocument = new Document();
newDocument.add(new TextField(TEXT_FIELD_NAME, new StringReader(word)));
indexWriter.addDocument(newDocument);
indexWriter.commit();
}
The exception seems to tell that I should close the TokenStream before adding things to the index, but this doesn't really make sense to me because how are index and TokenStream related? I mean, index just receives a Document containing a String, having the String coming from a TokenStream should be irrelevant.
Any hint on how to solve this?

The problem is in your reuse of the same analyzer that the IndexWriter is trying to use. You have a TokenStream open from that analyzer, and then you try to index a document. That document needs to be analyzed, but the analyzer finds it's old TokenStream is still open, and throws an exception.
To fix it, you could create a new, separate analyzer for processing and testing the string, instead of using the one that IndexWriter is using.

Sift features from Lire library

I am trying to find a sift implementation for lire library. The only thing I found is the above link feature. I am trying to understand what I ve got to use in order to extract sift feaures for an image.
Any idea what I ve got to do here?
I am trying something like:
Extractor e = new Extractor();
File img = new File("im.jpg");
BufferedImage in = ImageIO.read(img);
BufferedImage newImage = new BufferedImage(in.getWidth(),
in.getHeight(), BufferedImage.TYPE_BYTE_GRAY);
List<Feature> fs1 = e.computeSiftFeatures(newImage);
System.out.println(fs1);
But I ve got an empty list.

//Here is the revised answer for you it may help
public class indexing {
String directory="your_image_dataset";
String index="./images__idex";//where you will put the index
/* if you want to use BOVW based searching you can change the
numbers below but be careful */
int numClusters = 2000; // number of visual words
int numDocForVocabulary = 200;
/* number of samples used for visual words vocabulary building
this function calls the document builder and indexer function (indexFiles below)
for each image in the data set */
public void IndexImage() throws IOException{
System.out.println("-< Getting files to index >--------------");
List<String> images = FileUtils.getAllImages(new File(directory), true);
System.out.println("-< Indexing " + images.size() + " files >--------------");
indexFiles(images, index);
}
/* this function builds Lucene document for each image passed to it for
the extracted visual descriptors */
private void indexFiles(List<String> images, String index)
throws FileNotFoundException, IOException {
//first high level structure
ChainedDocumentBuilder documentBuilder = new ChainedDocumentBuilder();
//type of document to be created here i included different types of visual features,
//documentBuilder.addBuilder(new SurfDocumentBuilder());
//here choose either Surf or SIFT
documentBuilder.addBuilder(new SiftDocumentBuilder());
documentBuilder.addBuilder(DocumentBuilderFactory.getEdgeHistogramBuilder());
documentBuilder.addBuilder(DocumentBuilderFactory.getJCDDocumentBuilder());
documentBuilder.addBuilder(DocumentBuilderFactory.getColorLayoutBuilder());
//IndexWriter creates the file for index storage
IndexWriter iw = LuceneUtils.createIndexWriter(index, true);
int count = 0;
/*then each image in data set called up on the created document structure
(documentBuilder above and added to the index file by constructing the defined
document structure) */
for (String identifier : images) {
Document doc = documentBuilder.createDocument(new
FileInputStream(identifier), identifier);
iw.addDocument(doc);//adding document to index
iw.close();// closing the index writer
/* For searching purpose you will read the index and by constructing an instance of
IndexReader she you defined different searching strategy which is available in Lire
Please check the brace and test it. */

Lucene Java opening too many files. Am I using IndexWriter properly?

My Lucene Java implementation is eating up too many files. I followed the instructions in the Lucene Wiki about too many open files, but that only helped slow the problem. Here is my code to add objects (PTicket) to the index:
//This gets called when the bean is instantiated
public void initializeIndex() {
analyzer = new WhitespaceAnalyzer(Version.LUCENE_32);
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
}
public void addAllToIndex(Collection<PTicket> records) {
IndexWriter indexWriter = null;
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
try{
indexWriter = new IndexWriter(directory, config);
for(PTicket record : records) {
Document doc = new Document();
StringBuffer documentText = new StringBuffer();
doc.add(new Field("_id", record.getIdAsString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("_type", record.getType(), Field.Store.YES, Field.Index.ANALYZED));
for(String key : record.getProps().keySet()) {
List<String> vals = record.getProps().get(key);
for(String val : vals) {
addToDocument(doc, key, val);
documentText.append(val).append(" ");
}
}
addToDocument(doc, DOC_TEXT, documentText.toString());
indexWriter.addDocument(doc);
}
indexWriter.optimize();
} catch (Exception e) {
e.printStackTrace();
} finally {
cleanup(indexWriter);
}
}
private void cleanup(IndexWriter iw) {
if(iw == null) {
return;
}
try{
iw.close();
} catch (IOException ioe) {
logger.error("Error trying to close index writer");
logger.error("{}", ioe.getClass().getName());
logger.error("{}", ioe.getMessage());
}
}
private void addToDocument(Document doc, String field, String value) {
doc.add(new Field(field, value, Field.Store.YES, Field.Index.ANALYZED));
}
EDIT TO ADD code for searching
public Set<Object> searchIndex(AthenaSearch search) {
try {
Query q = new QueryParser(Version.LUCENE_32, DOC_TEXT, analyzer).parse(query);
//search is actually instantiated in initialization. Lucene recommends this.
//IndexSearcher searcher = new IndexSearcher(directory, true);
TopDocs topDocs = searcher.search(q, numResults);
ScoreDoc[] hits = topDocs.scoreDocs;
for(int i=start;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
ids.add(d.get("_id"));
}
return ids;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
This code is in a web application.
1) Is this the advised way to use IndexWriter (instantiating a new one on each add to index)?
2) I've read that raising ulimit will help, but that just seems like a band-aid that won't address the actual problem.
3) Could the problem lie with IndexSearcher?

1) Is this the advised way to use
IndexWriter (instantiating a new one
on each add to index)?
i advise No, there are constructors, which will check if exists or create a new writer, in the directory containing the index. problem 2 would be solved if you reuse the indexwriter.
EDIT:
Ok it seems in Lucene 3.2 the most but one constructors are deprecated,so the resue of Indexwriter can be achieved by using Enum IndexWriterConfig.OpenMode with value CREATE_OR_APPEND.
also, opening new writer and closing on each document add is not efficient,i suggest reuse, if you want to speed up indexing, set the setRamBufferSize default value is 16MB, so do it by trial and error method
from the docs:
Note that you can open an index with
create=true even while readers are
using the index. The old readers will
continue to search the "point in time"
snapshot they had opened, and won't
see the newly created index until they
re-open.
also reuse the IndexSearcher,i cannot see the code for searching, but Indexsearcher is threadsafe and can be used as Readonly as well
also i suggest you to use MergeFactor on writer, this is not necessary but will help on limiting the creation of inverted index files, do it by trial and error method

I think we'd need to see your search code to be sure, but I'd suspect that it is a problem with the index searcher. More specifically, make sure that your index reader is being properly closed when you've finished with it.
Good luck,

The scientific correct answer would be: You can't really tell by this fragment of code.
The more constructive answer would be:
You have to make sure that there is only one IndexWriter is writing to the index at any given time and you therefor need some mechanism to make sure of that. So my answer depends of what you want to accomplish:
do you want a deeper understanding of Lucene? or..
do you just want to build and use an index?
If you answer is the latter, you probably want to look at projects like Solr, which hides all the index reading and writing.

This question is probably a duplicate of
Too many open files Error on Lucene
I am repeating here my answer for that.
Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.
IndexWriter.setUseCompoundFile(true)

how to delete documents using term in lucene

I am trying to delete a document by using a term in lucene index. but the code that I made below isn't working. are there any suggestion of how can I perform deleting function in lucene index?
public class DocumentDelete {
public static void main(String[] args) {
File indexDir = new File("C:/Users/Raden/Documents/lucene/LuceneHibernate/adi");
Term term = new Term(FIELD_PATH, "compatible");
Directory directory = FSDirectory.getDirectory(indexDir);
IndexReader indexReader = IndexReader.open(directory);
indexReader.deleteDocuments(term);
indexReader.close();
}
}

IndexReader indexReader = IndexReader.open(directory); // this one uses default readonly mode
instead use this:
IndexReader indexReader = IndexReader.open(directory, false); // this will open the index in edit mode and you can delete the index. . .
So you do not need any extra tool for deleting index contents. . .

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to search a file with lucene - java

where are you indexing the content from fdictionary00.txt? You can search using IndexSearcher, only when you have index. If you are new to lucene, you might want to check some quick tutorials. (like http://lucenetutorial.com/lucene-in-5-minutes.html)

Related

Unable to identify error in Lucene MoreLikeThis

"TokenStream contract violation: close() call missing" when calling addDocument

Sift features from Lire library

Lucene Java opening too many files. Am I using IndexWriter properly?

how to delete documents using term in lucene

Categories

Resources