Lucene Incremental Update of Index - java

I'm a bit new to Lucene, I am using a huge database which I indexed previously. The problem is that it is not an efficient way to index the whole table/ database every time if something new is added into it. I'm using lucene3.6.2. I want to make an indexing function which adds the new data to the existing Lucene indexed files, without the need to updateDocument(or delete and re-index in lucene). I mean to say it should not create new files to store the new documents rather should insert them into the previous index files without deleting the previous data inside the index files and without re-indexing the whole database. Whose index should start from the last index location of the previously indexed item, and should be searchable along with the previously generated indexes. This is my indexer code for creating index:
public String TestIndex() throws IOException,SQLException
{
System.out.println("preparing dictionary");
String output="";
Long i=0l;
ResultSet rs = null;
URL u = this.getClass().getClassLoader(). getResource(SearchConstant.INDEX_DIRECTORY_DICTIONARYDETAILS);
String dirLoc = u.getPath().replace("%20", " ");
Directory index = FSDirectory.open(new File(dirLoc)); //new RAMDirectory();
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_30,analyzer);
config.setOpenMode(OpenMode.CREATE);
IndexWriter w = new IndexWriter(index, config);
try {
String SQL = "Select * from test";
cm = new DbUtility();
rs = cm.getData(SQL);
// 1. create the index
while (rs.next()) {
Document doc = new Document();
doc.add(new Field("id",rs.getObject(1).toString() , Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("Heading",rs.getObject(2).toString() , Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
i = i + 1;
}
System.out.println("I " + i.toString());
}
catch (Exception e) {
System.out.println("I in Error " + i.toString());
System.out.println("Error while retrieving data: "+e.getMessage());
}
w.close();
rs.close();
return output;
}

Related

Unable to identify error in Lucene MoreLikeThis

I need to use Lucene MoreLikeThis to find similar documents given a paragraph of text. I am new to Lucene and followed the code here
I have already indexed the documents at the directory - "C:\Users\lucene_index_files\v2"
I am using "They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP." as the document to which I want to find similar documents.
public class LuceneSearcher2 {
public static void main(String[] args) throws IOException {
LuceneSearcher2 m = new LuceneSearcher2();
System.out.println("1");
m.start();
System.out.println("2");
//m.writerEntries();
m.findSilimar("They are computer engineers and they like to develop their own tools. The program in languages like Java, CPP.");
System.out.println("3");
}
private Directory indexDir;
private StandardAnalyzer analyzer;
private IndexWriterConfig config;
public void start() throws IOException{
//analyzer = new StandardAnalyzer(Version.LUCENE_42);
//config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
analyzer = new StandardAnalyzer();
config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
indexDir = new RAMDirectory(); //don't write on disk
//https://stackoverflow.com/questions/36542551/lucene-in-java-method-not-found?rq=1
indexDir = FSDirectory.open(FileSystems.getDefault().getPath("C:\\Users\\lucene_index_files\\v2")); //write on disk
//System.out.println(indexDir);
}
private void findSilimar(String searchForSimilar) throws IOException {
IndexReader reader = DirectoryReader.open(indexDir);
IndexSearcher indexSearcher = new IndexSearcher(reader);
System.out.println("2a");
MoreLikeThis mlt = new MoreLikeThis(reader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[]{"title", "content"});
mlt.setAnalyzer(analyzer);
System.out.println("2b");
StringReader sReader = new StringReader(searchForSimilar);
//Query query = mlt.like(sReader, null);
//Throws error - The method like(String, Reader...) in the type MoreLikeThis is not applicable for the arguments (StringReader, null)
Query query = mlt.like("computer");
System.out.println("2c");
System.out.println(query.toString());
TopDocs topDocs = indexSearcher.search(query,10);
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
Document aSimilar = indexSearcher.doc( scoreDoc.doc );
String similarTitle = aSimilar.get("title");
String similarContent = aSimilar.get("content");
System.out.println("====similar finded====");
System.out.println("title: "+ similarTitle);
System.out.println("content: "+ similarContent);
}
System.out.println("2d");
}}
I am unsure as to what is causing the system to not generate an output/
What is your output ? I am assuming your not finding similar documents. The reason could be that the query you are creating is empty.
First of all to run your code in a meaningful way this line
Query query = mlt.like(sReader, null);
needs a String[] of field names as the argument, so it should work like this
Query query = mlt.like(sReader, new String[]{"title", "content"});
Now, in order to use MoreLikeThis in Lucene, your stored Fields have to have the set the option to store term vectors "setStoreTermVectors(true);" true when creating fields, for instance like this:
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setStoreTermVectors(true);
fieldType.setTokenized(true);
Field contentField = new Field("contents", this.getBlurb(), fieldType);
doc.add(contentField);
Leaving this out could result in an empty query string and consequently no results for the query

How can I get the terms of a Lucene document field tokens after they are analyzed?

I'm using Lucene 5.1.0. After Analyzing and indexing a document, I would like to get a list of all the terms indexed that belong to this specific document.
{
File[] files = FILES_TO_INDEX_DIRECTORY.listFiles();
for (File file : files) {
Document document = new Document();
Reader reader = new FileReader(file);
document.add(new TextField("fieldname",reader));
iwriter.addDocument(document);
}
iwriter.close();
IndexReader indexReader = DirectoryReader.open(directory);
int maxDoc=indexReader.maxDoc();
for (int i=0; i < maxDoc; i++) {
Document doc=indexReader.document(i);
String[] terms = doc.getValues("fieldname");
}
}
the terms return null. Is there a way to get the saved terms per document?
Here is a sample code for the answer, using a TokenStream
TokenStream ts= analyzer.tokenStream("myfield", reader);
// The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s),
// and pass the resulting Reader to the Tokenizer.
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
String term = charTermAttribute.toString();
System.out.println(term);
}
ts.end(); // Perform end-of-stream operations, e.g. set the final offset.
} finally {
ts.close(); // Release resources associated with this stream.
}

Lucene can't find documents after update

It seems that whenever I update an existing document in the index (same behavior for delete / add), it can't be found with a TermQuery. Here's a short snippet:
iw = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new StringField("string", "a", Store.YES));
doc.add(new IntField("int", 1, Store.YES));
iw.addDocument(doc);
Query query = new TermQuery(new Term("string","a"));
Document[] hits = search(query);
doc = hits[0];
print(doc);
doc.removeField("int");
doc.add(new IntField("int", 2, Store.YES));
iw.updateDocument(new Term("string","a"), doc);
hits = search(query);
System.out.println(hits.length);
System.out.println("_________________");
for(Document hit : search(new MatchAllDocsQuery())){
print(hit);
}
This produces the following console output:
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<string:a>
stored<int:1>
________________
0
_________________
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<string:a>
stored<int:2>
________________
It seems that after the update, the document (rather the new document) in the index and gets returned by the MatchAllDocsQuery, but can't be found by a TermQuery.
Full sample code available at http://pastebin.com/sP2Vav9v
Also, this only happens (second search not working) when the StringField value contains special characters (e.g. file:/F:/).
The code which you have referenced in pastebin doesn't find anything because your StringField is nothing but a stopword (a). Replacing a with something which is not a stopword (e.g. ax) makes both searches to return 1 doc.
You would also achieve the correct result if you were to construct StandardAnalyzer with empty stopword set (CharArraySet.EMPTY_SET) yet still using a for StringField. This wouldn't work for file:/F:/ though.
However, the best solution is this case would be to replace StandardAnalyzer with KeywordAnalyzer.
I could get rid of this by recreating my working directory after all indexing operations :
create a new directory just for this indexing operations named "path_dir" for example. If you have updated then call the following operations and do all of your previous works again.
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
FSDirectory dir;
try {
// delete indexing files :
dir = FSDirectory.open(new File(path_dir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
writer.deleteAll();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
However, note that this way will be very slow if you are handling big data.

how to search a file with lucene

I want to do a search for a query within a file "fdictionary.txt" containing a list of words (230,000 words) written line by line. any suggestion why this code is not working?
The spell checking part is working and gives me the list of suggestions (I limited the length of the list to 1). what I want to do is to search that fdictionary and if the word is already in there, do not call spell checking. My Search function is not working. It does not give me error! Here is what I have implemented:
public class SpellCorrection {
public static File indexDir = new File("/../idxDir");
public static void main(String[] args) throws IOException, FileNotFoundException, CorruptIndexException, ParseException {
Directory directory = FSDirectory.open(indexDir);
SpellChecker spell = new SpellChecker(directory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_20, null);
File dictionary = new File("/../fdictionary00.txt");
spell.indexDictionary(new PlainTextDictionary(dictionary), config, true);
String query = "red"; //kne, console
String correctedQuery = query; //kne, console
if (!search(directory, query)) {
String[] suggestions = spell.suggestSimilar(query, 1);
if (suggestions != null) {correctedQuery=suggestions[0];}
}
System.out.println("The Query was: "+query);
System.out.println("The Corrected Query is: "+correctedQuery);
}
public static boolean search(Directory directory, String queryTerm) throws FileNotFoundException, CorruptIndexException, IOException, ParseException {
boolean isIn = false;
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
Term term = new Term(queryTerm);
Query termQuery = new TermQuery(term);
TopDocs hits = indexSearcher.search(termQuery, 100);
System.out.println(hits.totalHits);
if (hits.totalHits > 0) {
isIn = true;
}
return isIn;
}
}
where are you indexing the content from fdictionary00.txt?
You can search using IndexSearcher, only when you have index. If you are new to lucene, you might want to check some quick tutorials. (like http://lucenetutorial.com/lucene-in-5-minutes.html)
You never built the index.
You need to setup the index...
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
IndexWriter writer = new IndexWriter(directory,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED );
You then need to create a document and add each term to the document as an analyzed field..
Document doc = new Document();
doc.Add(new Field("name", word , Field.Store.YES, Field.Index.ANALYZED));
Then add the document to the index
writer.AddDocument(doc);
writer.Optimize();
Now build the index and close the index writer.
writer.Commit();
writer.Close();
You could make your SpellChecker instance available in a service and use spellChecker.exist(word).
Be aware that the SpellChecker will not index words 2 characters or less. To get around this you can add them to the index after you have created it (add them into SpellChecker.F_WORD field).
If you want to add to your live index and make them available for exist(word) then you will need to add them to the SpellChecker.F_WORD field. Of course, because you're not adding to all the other fields such as gram/start/end etc then your word will not appear as a suggestion for other misspelled words.
In this case you'd have had to add the word into your file so when you re-create the index it would then be available as a suggestion. It would be great if the project made SpellChecker.createDocument(...) public/protected, rather than private, as this method accomplishes everything with adding words.
After all this your need to call spellChecker.setSpellIndex(directory).

Open Microsoft Word in Java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.
First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.
I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..
You could try OpenOffice API, but there arent many resources out there to tell you how to use it.
You can also try this one: http://www.dancrintea.ro/doc-to-pdf/
Looks like this could be the issue.

Categories