I'm using Lucene's features to build a simple way to match similar words within a text.
My idea is to have have an Analyzer running on my text to provide a TokenStream, and for each token I run a FuzzyQuery to see if I have a match in my index. If not I just index a new Document containing just the new unique word.
Here's what I'm getting tho:
Exception in thread "main" java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:411)
at org.apache.lucene.analysis.standard.StandardAnalyzer$1.setReader(StandardAnalyzer.java:111)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:165)
at org.apache.lucene.document.Field.tokenStream(Field.java:568)
at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:708)
at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:417)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:373)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:231)
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1562)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1307)
at org.myPackage.MyClass.addToIndex(MyClass.java:58)
Relevant code here:
// Setup tokenStream based on StandardAnalyzer
TokenStream tokenStream = analyzer.tokenStream(TEXT_FIELD_NAME, new StringReader(input));
tokenStream = new StopFilter(tokenStream, EnglishAnalyzer.getDefaultStopSet());
tokenStream = new ShingleFilter(tokenStream, 3);
tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
...
// Iterate and process each token from the stream
while (tokenStream.incrementToken()) {
CharTermAttribute charTerm = tokenStream.getAttribute(CharTermAttribute.class);
processWord(charTerm.toString());
}
...
// Processing a word means looking for a similar one inside the index and, if not found, adding this one to the index
void processWord(String word) {
...
if (DirectoryReader.indexExists(index)) {
reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs searchResults = searcher.search(query, 1);
if (searchResults.totalHits > 0) {
Document foundDocument = searcher.doc(searchResults.scoreDocs[0].doc);
super.processWord(foundDocument.get(TEXT_FIELD_NAME));
} else {
addToIndex(word);
}
} else {
addToIndex(word);
}
...
}
...
// Create a new Document to index the provided word
void addWordToIndex(String word) throws IOException {
Document newDocument = new Document();
newDocument.add(new TextField(TEXT_FIELD_NAME, new StringReader(word)));
indexWriter.addDocument(newDocument);
indexWriter.commit();
}
The exception seems to tell that I should close the TokenStream before adding things to the index, but this doesn't really make sense to me because how are index and TokenStream related? I mean, index just receives a Document containing a String, having the String coming from a TokenStream should be irrelevant.
Any hint on how to solve this?
The problem is in your reuse of the same analyzer that the IndexWriter is trying to use. You have a TokenStream open from that analyzer, and then you try to index a document. That document needs to be analyzed, but the analyzer finds it's old TokenStream is still open, and throws an exception.
To fix it, you could create a new, separate analyzer for processing and testing the string, instead of using the one that IndexWriter is using.
Related
I m trying to create a custom analyzer in Lucene 8.3.0 that uses stemming and filters the given text using custom stop words from a file.
To be more clear, I don't want to use the default stop words filter and add some words on it, I want to filter using only a set of stop words from a stopWords.txt file.
How can I do this?
This is what I have written until now, but I am not sure if it is right
public class MyAnalyzer extends Analyzer{
//public class MyAnalyzer extends Analyzer {
#Override
protected TokenStreamComponents createComponents(String fieldName) {
// public TokenStream tokenStream(String fieldName, Reader reader) {
Tokenizer tokenizer = new StandardTokenizer();
TokenStream tokenStream = new StandardFilter(tokenizer);
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new StopFilter(tokenStream,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
//Adding Porter Stemming filtering
tokenStream = new PorterStemFilter(tokenStream);
//return tokenStream;
return new TokenStreamComponents(tokenizer, tokenStream);
}
}
First of all I am not sure if the structure is correct and for now I am using the StopFilter from StopAnalyzer just to test it (however it's not working).
You need to read the file and parse it to a CharArraySet to pass into the filter. StopFilter has some built in methods you can use to convert a List of Strings to a CharArraySet, like:
...
CharArraySet stopset = StopFilter.makeStopSet(myStopwordList);
tokenStream = new StopFilter(tokenStream, stopset);
...
It's listed as for internal purposes, so fair warning about relying on this class, but if you don't want to have to handle parsing your file to a list, you could use WordListLoader to parse your stopword file into a CharArraySet, something like:
...
CharArraySet stopset = WordlistLoader.getWordSet(myStopfileReader);
tokenStream = new StopFilter(tokenStream, stopset);
...
I'm not getting the expected results from my Analyzer and would like to test the tokenization process.
The answer this question: How to use a Lucene Analyzer to tokenize a String?
List<String> result = new ArrayList<String>();
TokenStream stream = analyzer.tokenStream(field, new StringReader(keywords));
try {
while(stream.incrementToken()) {
result.add(stream.getAttribute(TermAttribute.class).term());
}
}
catch(IOException e) {
// not thrown b/c we're using a string reader...
}
return result;
Uses the TermAttribute to extract the tokens from the stream. The problem is that TermAttribute is no longer in Lucene 6.
What has it been replaced by?
What would the equivalent be with Lucene 6.6.0?
I'm pretty sure it was replaced by CharTermAttribute javadoc
The ticket is pretty old, but maybe the code was kept around a bit longer:
https://issues.apache.org/jira/browse/LUCENE-2372
I'm using Lucene 5.1.0. After Analyzing and indexing a document, I would like to get a list of all the terms indexed that belong to this specific document.
{
File[] files = FILES_TO_INDEX_DIRECTORY.listFiles();
for (File file : files) {
Document document = new Document();
Reader reader = new FileReader(file);
document.add(new TextField("fieldname",reader));
iwriter.addDocument(document);
}
iwriter.close();
IndexReader indexReader = DirectoryReader.open(directory);
int maxDoc=indexReader.maxDoc();
for (int i=0; i < maxDoc; i++) {
Document doc=indexReader.document(i);
String[] terms = doc.getValues("fieldname");
}
}
the terms return null. Is there a way to get the saved terms per document?
Here is a sample code for the answer, using a TokenStream
TokenStream ts= analyzer.tokenStream("myfield", reader);
// The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s),
// and pass the resulting Reader to the Tokenizer.
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
String term = charTermAttribute.toString();
System.out.println(term);
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
It seems that whenever I update an existing document in the index (same behavior for delete / add), it can't be found with a TermQuery. Here's a short snippet:
iw = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new StringField("string", "a", Store.YES));
doc.add(new IntField("int", 1, Store.YES));
iw.addDocument(doc);
Query query = new TermQuery(new Term("string","a"));
Document[] hits = search(query);
doc = hits[0];
print(doc);
doc.removeField("int");
doc.add(new IntField("int", 2, Store.YES));
iw.updateDocument(new Term("string","a"), doc);
hits = search(query);
System.out.println(hits.length);
System.out.println("_________________");
for(Document hit : search(new MatchAllDocsQuery())){
print(hit);
}
This produces the following console output:
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<string:a>
stored<int:1>
________________
0
_________________
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<string:a>
stored<int:2>
________________
It seems that after the update, the document (rather the new document) in the index and gets returned by the MatchAllDocsQuery, but can't be found by a TermQuery.
Full sample code available at http://pastebin.com/sP2Vav9v
Also, this only happens (second search not working) when the StringField value contains special characters (e.g. file:/F:/).
The code which you have referenced in pastebin doesn't find anything because your StringField is nothing but a stopword (a). Replacing a with something which is not a stopword (e.g. ax) makes both searches to return 1 doc.
You would also achieve the correct result if you were to construct StandardAnalyzer with empty stopword set (CharArraySet.EMPTY_SET) yet still using a for StringField. This wouldn't work for file:/F:/ though.
However, the best solution is this case would be to replace StandardAnalyzer with KeywordAnalyzer.
I could get rid of this by recreating my working directory after all indexing operations :
create a new directory just for this indexing operations named "path_dir" for example. If you have updated then call the following operations and do all of your previous works again.
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
FSDirectory dir;
try {
// delete indexing files :
dir = FSDirectory.open(new File(path_dir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
writer.deleteAll();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
However, note that this way will be very slow if you are handling big data.
I want to do a search for a query within a file "fdictionary.txt" containing a list of words (230,000 words) written line by line. any suggestion why this code is not working?
The spell checking part is working and gives me the list of suggestions (I limited the length of the list to 1). what I want to do is to search that fdictionary and if the word is already in there, do not call spell checking. My Search function is not working. It does not give me error! Here is what I have implemented:
public class SpellCorrection {
public static File indexDir = new File("/../idxDir");
public static void main(String[] args) throws IOException, FileNotFoundException, CorruptIndexException, ParseException {
Directory directory = FSDirectory.open(indexDir);
SpellChecker spell = new SpellChecker(directory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_20, null);
File dictionary = new File("/../fdictionary00.txt");
spell.indexDictionary(new PlainTextDictionary(dictionary), config, true);
String query = "red"; //kne, console
String correctedQuery = query; //kne, console
if (!search(directory, query)) {
String[] suggestions = spell.suggestSimilar(query, 1);
if (suggestions != null) {correctedQuery=suggestions[0];}
}
System.out.println("The Query was: "+query);
System.out.println("The Corrected Query is: "+correctedQuery);
}
public static boolean search(Directory directory, String queryTerm) throws FileNotFoundException, CorruptIndexException, IOException, ParseException {
boolean isIn = false;
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
Term term = new Term(queryTerm);
Query termQuery = new TermQuery(term);
TopDocs hits = indexSearcher.search(termQuery, 100);
System.out.println(hits.totalHits);
if (hits.totalHits > 0) {
isIn = true;
}
return isIn;
}
}
where are you indexing the content from fdictionary00.txt?
You can search using IndexSearcher, only when you have index. If you are new to lucene, you might want to check some quick tutorials. (like http://lucenetutorial.com/lucene-in-5-minutes.html)
You never built the index.
You need to setup the index...
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
IndexWriter writer = new IndexWriter(directory,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED );
You then need to create a document and add each term to the document as an analyzed field..
Document doc = new Document();
doc.Add(new Field("name", word , Field.Store.YES, Field.Index.ANALYZED));
Then add the document to the index
writer.AddDocument(doc);
writer.Optimize();
Now build the index and close the index writer.
writer.Commit();
writer.Close();
You could make your SpellChecker instance available in a service and use spellChecker.exist(word).
Be aware that the SpellChecker will not index words 2 characters or less. To get around this you can add them to the index after you have created it (add them into SpellChecker.F_WORD field).
If you want to add to your live index and make them available for exist(word) then you will need to add them to the SpellChecker.F_WORD field. Of course, because you're not adding to all the other fields such as gram/start/end etc then your word will not appear as a suggestion for other misspelled words.
In this case you'd have had to add the word into your file so when you re-create the index it would then be available as a suggestion. It would be great if the project made SpellChecker.createDocument(...) public/protected, rather than private, as this method accomplishes everything with adding words.
After all this your need to call spellChecker.setSpellIndex(directory).