Lucene Java opening too many files. Am I using IndexWriter properly? - java

My Lucene Java implementation is eating up too many files. I followed the instructions in the Lucene Wiki about too many open files, but that only helped slow the problem. Here is my code to add objects (PTicket) to the index:
//This gets called when the bean is instantiated
public void initializeIndex() {
analyzer = new WhitespaceAnalyzer(Version.LUCENE_32);
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
}
public void addAllToIndex(Collection<PTicket> records) {
IndexWriter indexWriter = null;
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
try{
indexWriter = new IndexWriter(directory, config);
for(PTicket record : records) {
Document doc = new Document();
StringBuffer documentText = new StringBuffer();
doc.add(new Field("_id", record.getIdAsString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("_type", record.getType(), Field.Store.YES, Field.Index.ANALYZED));
for(String key : record.getProps().keySet()) {
List<String> vals = record.getProps().get(key);
for(String val : vals) {
addToDocument(doc, key, val);
documentText.append(val).append(" ");
}
}
addToDocument(doc, DOC_TEXT, documentText.toString());
indexWriter.addDocument(doc);
}
indexWriter.optimize();
} catch (Exception e) {
e.printStackTrace();
} finally {
cleanup(indexWriter);
}
}
private void cleanup(IndexWriter iw) {
if(iw == null) {
return;
}
try{
iw.close();
} catch (IOException ioe) {
logger.error("Error trying to close index writer");
logger.error("{}", ioe.getClass().getName());
logger.error("{}", ioe.getMessage());
}
}
private void addToDocument(Document doc, String field, String value) {
doc.add(new Field(field, value, Field.Store.YES, Field.Index.ANALYZED));
}
EDIT TO ADD code for searching
public Set<Object> searchIndex(AthenaSearch search) {
try {
Query q = new QueryParser(Version.LUCENE_32, DOC_TEXT, analyzer).parse(query);
//search is actually instantiated in initialization. Lucene recommends this.
//IndexSearcher searcher = new IndexSearcher(directory, true);
TopDocs topDocs = searcher.search(q, numResults);
ScoreDoc[] hits = topDocs.scoreDocs;
for(int i=start;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
ids.add(d.get("_id"));
}
return ids;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
This code is in a web application.
1) Is this the advised way to use IndexWriter (instantiating a new one on each add to index)?
2) I've read that raising ulimit will help, but that just seems like a band-aid that won't address the actual problem.
3) Could the problem lie with IndexSearcher?

1) Is this the advised way to use
IndexWriter (instantiating a new one
on each add to index)?
i advise No, there are constructors, which will check if exists or create a new writer, in the directory containing the index. problem 2 would be solved if you reuse the indexwriter.
EDIT:
Ok it seems in Lucene 3.2 the most but one constructors are deprecated,so the resue of Indexwriter can be achieved by using Enum IndexWriterConfig.OpenMode with value CREATE_OR_APPEND.
also, opening new writer and closing on each document add is not efficient,i suggest reuse, if you want to speed up indexing, set the setRamBufferSize default value is 16MB, so do it by trial and error method
from the docs:
Note that you can open an index with
create=true even while readers are
using the index. The old readers will
continue to search the "point in time"
snapshot they had opened, and won't
see the newly created index until they
re-open.
also reuse the IndexSearcher,i cannot see the code for searching, but Indexsearcher is threadsafe and can be used as Readonly as well
also i suggest you to use MergeFactor on writer, this is not necessary but will help on limiting the creation of inverted index files, do it by trial and error method

I think we'd need to see your search code to be sure, but I'd suspect that it is a problem with the index searcher. More specifically, make sure that your index reader is being properly closed when you've finished with it.
Good luck,

The scientific correct answer would be: You can't really tell by this fragment of code.
The more constructive answer would be:
You have to make sure that there is only one IndexWriter is writing to the index at any given time and you therefor need some mechanism to make sure of that. So my answer depends of what you want to accomplish:
do you want a deeper understanding of Lucene? or..
do you just want to build and use an index?
If you answer is the latter, you probably want to look at projects like Solr, which hides all the index reading and writing.

This question is probably a duplicate of
Too many open files Error on Lucene
I am repeating here my answer for that.
Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.
IndexWriter.setUseCompoundFile(true)

Related

Liferay Concurrent FileEntry Upload

Problem Statement :
In liferay i have to import a zip file in to some folder in liferay cms, So far I had implemented serial unzipping of the zip file create it's folder and then it's files. The problem here is that the whole process takes a lot of time. So I had to use parallel approach in creating folders and creating files.
My Solution :
I have used a java java.util.concurrent.ExecutorService to create a Executors.newFixedThreadPool(NTHREDS) where NTHREDS is the number of threads to be run in parallel (say 5)
I read all the folder paths from the zip and placed , list of zip
entires (files) against folder path as a key in HashMap
Traversed all keys in the map and created folders serially
Now traversed the list of zip entries (files) from map and passed to a thread worker,one file for each worker, these workers are then sent to
ExecutorService to Execute
So far i didn't find any significant reduction in time of the whole process, am i moving in the correct direction? Does liferay support concurrent file addition? What am I doing wrong?
I will be much thankful for any help in this regard
below is my code
imports
...
...
public class TestImportZip {
private static final int NTHREDS = 5;
ExecutorService executor = null;
...
...
....
Map<String,Folder> folders = new HashMap<String,Folder>();
File zipsFile = null;
public TestImportZip(............,File zipFile, .){
.
.
this.zipsFile = zipFile;
this.executor = Executors.newFixedThreadPool(NTHREDS);
}
// From here the process starts
public void importZip() {
Map<String,List<ZipEntry>> foldersMap = new HashMap<String, List<ZipEntry>>();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
zipFile.stream().forEach(entry -> {
String entryName = entry.getName();
if(entryName.contains("/")) {
String key = entryName.substring(0, entryName.lastIndexOf("/"));
List<ZipEntry> zipEntries = foldersMap.get(key);
if(zipEntries == null){
zipEntries = new ArrayList<>();
}
zipEntries.add(entry);
foldersMap.put(key,zipEntries);
}
});
createFolders(foldersMap.keySet());
createFiles(foldersMap);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void createFolders(Set<String> folderPathSets) {
// create folder and put the folder in map
.
.
.
folders.put(folderPath,folder);
}
private void createFiles(Map<String, List<ZipEntry>> foldersMap) {
.
.
.
//Traverse all the files from all the list in map and send them to worker
createFileWorker(folderPath,zipEntry);
}
private void createFileWorker(String folderPath,ZipEntry zipEntry) {
CreateEntriesWorker cfw = new CreateEntriesWorker(folderPath, zipEntry);
executor.execute(cfw);
}
class CreateEntriesWorker implements Runnable{
Folder folder = null;
ZipEntry entryToCreate = null;
public CreateEntriesWorker(String folderPath, ZipEntry zipEntry){
this.entryToCreate = zipEntry;
// get folder from already created folder map
this.folder = folders.get(folderPath);
}
public void run() {
if(this.folder != null) {
long startTime = System.currentTimeMillis();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
InputStream inputStream = zipFile.getInputStream(entryToCreate);
try{
String name = entryToCreate.getName();
// created file entry here
}catch(Exception e){
}finally{
if(inputStream != null)
inputStream.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
}
Your simplified code does not contain any Liferay reference that I recognize. The description you provide gives a hint that you're trying to optimize some code, but don't get any better performance out of your try. This typically is a sign that you're trying to optimize the wrong aspect of the problem (or it's already quite optimized).
You'll need to determine the actual bottleneck of your operation in order to know if it's feasible to optimize. There's a common saying that "premature optimization is the root of all evil". What does it mean?
I'll completely make up numbers here - don't quote me on them: They're freely invented for illustration purposes. Let's say, that your operation of adding the contents of a Zip file to Liferay's repository is distributed to the following percentages of operational resources:
4% zip file decoding/decompressing
6% file I/O for zip operations and temporary files
10% database operation for storing the files
60% for extracting text-only from word, pdf, excel and other files stored within the zip file in order to index the document in the full-text index
20% overhead of the full-text indexing library for putting together the index.
Suppose you're optimizing the zip file decoding/decompressing - what overall improvement of numbers can you expect?
While my numbers are made up: If your optimizations do not have any result, I'd recommend to reverse them, measure where you need to optimize and go after that place (or accept it and upgrade your hardware if that place is out of reach).
Run those numbers for CPU, I/O, memory and other potential bottlenecks. Identify your actual bottleneck #1, fix it, measure again. You'll see that bottleneck #2 has gotten a promotion. Rinse repeat until you're happy

Lucene can't find documents after update

It seems that whenever I update an existing document in the index (same behavior for delete / add), it can't be found with a TermQuery. Here's a short snippet:
iw = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new StringField("string", "a", Store.YES));
doc.add(new IntField("int", 1, Store.YES));
iw.addDocument(doc);
Query query = new TermQuery(new Term("string","a"));
Document[] hits = search(query);
doc = hits[0];
print(doc);
doc.removeField("int");
doc.add(new IntField("int", 2, Store.YES));
iw.updateDocument(new Term("string","a"), doc);
hits = search(query);
System.out.println(hits.length);
System.out.println("_________________");
for(Document hit : search(new MatchAllDocsQuery())){
print(hit);
}
This produces the following console output:
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<string:a>
stored<int:1>
________________
0
_________________
stored,indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY<string:a>
stored<int:2>
________________
It seems that after the update, the document (rather the new document) in the index and gets returned by the MatchAllDocsQuery, but can't be found by a TermQuery.
Full sample code available at http://pastebin.com/sP2Vav9v
Also, this only happens (second search not working) when the StringField value contains special characters (e.g. file:/F:/).
The code which you have referenced in pastebin doesn't find anything because your StringField is nothing but a stopword (a). Replacing a with something which is not a stopword (e.g. ax) makes both searches to return 1 doc.
You would also achieve the correct result if you were to construct StandardAnalyzer with empty stopword set (CharArraySet.EMPTY_SET) yet still using a for StringField. This wouldn't work for file:/F:/ though.
However, the best solution is this case would be to replace StandardAnalyzer with KeywordAnalyzer.
I could get rid of this by recreating my working directory after all indexing operations :
create a new directory just for this indexing operations named "path_dir" for example. If you have updated then call the following operations and do all of your previous works again.
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
FSDirectory dir;
try {
// delete indexing files :
dir = FSDirectory.open(new File(path_dir));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter writer = new IndexWriter(dir, config);
writer.deleteAll();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
However, note that this way will be very slow if you are handling big data.

Set indexed attribute of a lucene field?

I have a big lucene index produced by 3rd party.
I want to search over a field which is not indexed. Is it possible to re-create the index with that field being now indexed?
I am assuming that field is stored right? If not, you are out of luck.
If it is stored, you have several options, I think the easiest would be:
dump all docs as csv output (see here)
change that field's schema to indexed=true
then reindex all of them (csv output can be used for update as well)
Solved myself, just using an index reader and a writer.
I dunno if it's the proper way. The field was a string field (stored), so for this case, it just worked.
IndexReader reader = IndexReader.open(...);
IndexWriter writer = new IndexWriter(...);
for(int i = 0; i < reader.maxDoc(); i++) {
if(reader.isDeleted(i)) continue;
Document d = reader.document(i);
Document d2 = new Document();
for(Field f : (List<Field>)d.getFields()) {
Field f2 = f;
if(f.name().equals(FIELD_NAME))
f2 = new Field(FIELD_NAME, f.stringValue(), Field.Store.YES, Field.Index.NOT_ANALYZED);
d2.add(f2);
}
writer.addDocument(d2);
}
writer.optimize();
writer.close();

how to search a file with lucene

I want to do a search for a query within a file "fdictionary.txt" containing a list of words (230,000 words) written line by line. any suggestion why this code is not working?
The spell checking part is working and gives me the list of suggestions (I limited the length of the list to 1). what I want to do is to search that fdictionary and if the word is already in there, do not call spell checking. My Search function is not working. It does not give me error! Here is what I have implemented:
public class SpellCorrection {
public static File indexDir = new File("/../idxDir");
public static void main(String[] args) throws IOException, FileNotFoundException, CorruptIndexException, ParseException {
Directory directory = FSDirectory.open(indexDir);
SpellChecker spell = new SpellChecker(directory);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_20, null);
File dictionary = new File("/../fdictionary00.txt");
spell.indexDictionary(new PlainTextDictionary(dictionary), config, true);
String query = "red"; //kne, console
String correctedQuery = query; //kne, console
if (!search(directory, query)) {
String[] suggestions = spell.suggestSimilar(query, 1);
if (suggestions != null) {correctedQuery=suggestions[0];}
}
System.out.println("The Query was: "+query);
System.out.println("The Corrected Query is: "+correctedQuery);
}
public static boolean search(Directory directory, String queryTerm) throws FileNotFoundException, CorruptIndexException, IOException, ParseException {
boolean isIn = false;
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
Term term = new Term(queryTerm);
Query termQuery = new TermQuery(term);
TopDocs hits = indexSearcher.search(termQuery, 100);
System.out.println(hits.totalHits);
if (hits.totalHits > 0) {
isIn = true;
}
return isIn;
}
}
where are you indexing the content from fdictionary00.txt?
You can search using IndexSearcher, only when you have index. If you are new to lucene, you might want to check some quick tutorials. (like http://lucenetutorial.com/lucene-in-5-minutes.html)
You never built the index.
You need to setup the index...
Directory directory = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_20);
IndexWriter writer = new IndexWriter(directory,analyzer,true,IndexWriter.MaxFieldLength.UNLIMITED );
You then need to create a document and add each term to the document as an analyzed field..
Document doc = new Document();
doc.Add(new Field("name", word , Field.Store.YES, Field.Index.ANALYZED));
Then add the document to the index
writer.AddDocument(doc);
writer.Optimize();
Now build the index and close the index writer.
writer.Commit();
writer.Close();
You could make your SpellChecker instance available in a service and use spellChecker.exist(word).
Be aware that the SpellChecker will not index words 2 characters or less. To get around this you can add them to the index after you have created it (add them into SpellChecker.F_WORD field).
If you want to add to your live index and make them available for exist(word) then you will need to add them to the SpellChecker.F_WORD field. Of course, because you're not adding to all the other fields such as gram/start/end etc then your word will not appear as a suggestion for other misspelled words.
In this case you'd have had to add the word into your file so when you re-create the index it would then be available as a suggestion. It would be great if the project made SpellChecker.createDocument(...) public/protected, rather than private, as this method accomplishes everything with adding words.
After all this your need to call spellChecker.setSpellIndex(directory).

What is the most memory efficient way to write from a database to a (zip) file in Java?

My program is fast enough, but I'd rather give up that speed for memory optimization since one user's maximum memory usage goes up to 300 MB meaning few of them could constantly crash the application. Most of the answers I found were related to speed optimization, and other were just general ("if you write directly from a database to memory there shouldn't be much memory usage"). Well, it seems there is :) I was thinking about not posting code so I wouldn't "lock" someone's ideas, but on the other hand, I could be wasting your time if you don't see what I've already done so here it is:
// First I get the data from the database in a way that I think can't be more
// optimized since i've done some testing and it seems to me that the problem
// isn't in the RS and setting FetchSize and/or direction does not help.
public static void generateAndWriteXML(String query, String oznaka, BufferedOutputStream bos, Connection conn)
throws Exception
{
ResultSet rs = null;
Statement stmt = null;
try
{
stmt = conn.createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
rs = stmt.executeQuery(query);
writeToZip(rs, oznaka, bos);
} finally
{
ConnectionManager.close(rs, stmt, conn);
}
}
// then I open up my streams. In the next method I'll generate an XML from the
// ResultSet and I want that XML to be saved in an XML, but since its size takes up
// to 300MB, I want it to be saved in a ZIP. I'm thinking that maybe by writing
// first to file, then to zip I could get a slower but more efficient program.
private static void writeToZip(ResultSet rs, String oznaka, BufferedOutputStream bos)
throws SAXException, SQLException, IOException
{
ZipEntry ze = new ZipEntry(oznaka + ".xml");
ZipOutputStream zos = new ZipOutputStream(bos);
zos.putNextEntry(ze);
OutputStreamWriter writer = new OutputStreamWriter(zos, "UTF8");
writeXMLToWriter(rs, writer);
try
{
writer.close();
} catch (IOException e)
{
}
try
{
zos.closeEntry();
} catch (IOException e)
{
}
try
{
zos.flush();
} catch (IOException e)
{
}
try
{
bos.close();
} catch (IOException e)
{
}
}
// And finally, the method that does the actual generating and writing.
// This is the second point I think I could do the memory optimization since the
// DataWriter is custom and it extends a custom XMLWriter that extends the standard
// org.xml.sax.helpers.XMLFilterImpl I've tried with flushing at points in program,
// but the memory that is occupied remains the same, it only takes longer.
public static void writeXMLToWriter(ResultSet rs, Writer writer) throws SAXException, SQLException, IOException
{
//Set up XML
DataWriter w = new DataWriter(writer);
w.startDocument();
w.setIndentStep(2);
w.startElement(startingXMLElement);
// Get the metadata
ResultSetMetaData meta = rs.getMetaData();
int count = meta.getColumnCount();
// Iterate over the set
while (rs.next())
{
w.startElement(rowElement);
for (int i = 0; i < count; i++)
{
Object ob = rs.getObject(i + 1);
if (rs.wasNull())
{
ob = null;
}
// XML elements are repeated so they could benefit from caching
String colName = meta.getColumnLabel(i + 1).intern();
if (ob != null)
{
if (ob instanceof Timestamp)
{
w.dataElement(colName, Util.formatDate((Timestamp) ob, dateFormat));
}
else if (ob instanceof BigDecimal)
{
// Possible benefit from writing ints as strings and interning them
w.dataElement(colName, Util.transformToHTML(new Integer(((BigDecimal) ob).intValue())));
}
else
{ // there's enough of data that's repeated to validate the use of interning
w.dataElement(colName, ob.toString().intern());
}
}
else
{
w.emptyElement(colName);
}
}
w.endElement(rowElement);
}
w.endElement(startingXMLElement);
w.endDocument();
}
EDIT: Here is an example of memory usage (taken with visualVM):
EDIT2: The database is Oracle 10.2.0.4. and I've set ResultSet.TYPE_FORWARD_ONLY and got a maximum of 50MB usage! As I said in the comments, I'll keep an eye on this, but it's really promising.
EDIT3: It seems there's another possible optimization available. As I said, I'm generating an XML, meaning lots of data is repeated (if nothing else, then tags), meaning String.intern() could help me here, I'll post back when I test this.
Is it possible to use ResultSet.TYPE_FORWARD_ONLY?
You have used ResultSet.TYPE_SCROLL_INSENSITIVE. I believe for some databases (you didn't say which one you use) this causes the whole result set to be loaded in memory.
Since it's Java, the memory should only spike temporarily, unless you are leaking references, like if you push things onto a list that is a member of a singleton that has life span of the entire program, or in my experience more likely is resource leaking, which happens when (and this I'm assuming applies to Java although I'm thinking of C#) objects that use unmanaged resources like file handles never call their cleanup code, a condition commonly caused by empty exception handlers that do not re-throw to the parent stack frame, which has the net effect of circumventing the finally block...
I've ran some more tests and the conclusions are:
The biggest gain is in JVM (or visualvm has problems monitoring Java 5 Heap space:). When I first reported that ResultSet.TYPE_FORWARD_ONLY got me a significant gain, I was wrong. The biggest gain was by using Java 5 under which the same program used up to 50MB of heapspace, as opposed to Java 6 under which the same code took up to 150 MB.
Second gain is in ResultSet.TYPE_FORWARD_ONLY which made the program take as small amount of memory as possible.
Third gain is in Sting.intern() which made the program take a bit less memory since it caches strings instead of creating new ones.
This is the usage with the optimizations 2 and 3 (if there wasn't String.intern() the graph would be the same, you should only add 5 MB more to every point)
and this is the usage without them (the lesser usage at the end is due to the program going out of memory :) )
Thank you everyone for your assistance.

Categories