Do you need a new instance of IndexReader and IndexSearcher every time?

Do you need a new instance of IndexReader and IndexSearcher every time? - java

In a web application, I'm having one IndexReader and one respective IndexSearcher for the whole application.
The documentation says they are thread-safe, so that's ok, but is it supposed to work properly if the underlying index changes (e.g. an IndexWriter makes changes)?

Yes, you need to reopen the reader.
Use the SearcherManager to automatically update the reader, calling maybeRefresh when you've made changes.
Samples of Scala code below:
#volatile protected var LAST_SEARCHER_REFRESH = 0
protected lazy val SEARCHER_MANAGER = new SearcherManager (LUCENE_DIR, new SearcherFactory {
override def newSearcher (reader: IndexReader): IndexSearcher = {new IndexSearcher (reader)}
})
...
if (LAST_SEARCHER_REFRESH != Time.relativeSeconds()) {LAST_SEARCHER_REFRESH = Time.relativeSeconds(); SEARCHER_MANAGER.maybeRefresh()}
val searcher = SEARCHER_MANAGER.acquire(); try {
searcher.search (query, collector)
...
} finally {SEARCHER_MANAGER.release (searcher)}
Sometimes you'll have to implement your own caching though, like when you need to synchronize the IndexReader with TaxonomyReader. IndexSearcher description has recommendations about how to do that (there is a fast path via DirectoryReader.open(IndexWriter, applyAllDeletes) to make a new reader from a writer used to commit the changes).

Related

Executor service in executor service?

In document import method, I work with large number of files. Each file size can also 100mb-200mb. I want to use threading in asynchronously. In for loop, each file is processed and then indexed(lucene). This operation is very cost and time useless in real time. Total operation must not stop.
General structure of import method is given below:
public void docImport()
{
ExecutorService executor = Executors.newFixedThreadPool(5);
for(final File file : fileList)
{
//Do some works...
executor.execute(new Runnable() {
#Override
public void run() {
zipFile(file); //Each zipped file has diff name and same directory.
indexFile(file); //Each file is indexed same directory.
}
});
}
executor.shutdown();
}
General structure of indexFile method :
public void indexFile()
{
ExecutorService executor = Executors.newFixedThreadPool(1);
IndexWriter writer = null;
Directory dir = .....;
Analyzer analyzer = new StandardAnalyzer(LUCENE_VERSION);
IndexWriterConfig iwc = new IndexWriterConfig(LUCENE_VERSION, analyzer);
iwc.setRAMBufferSizeMB(200);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter(dir, iwc);
Document lucenedoc = new Document();
lucenedoc.add(..);
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {
writer.addDocument(lucenedoc);
} else {
writer.updateDocument(new Term(PATH, innerPath), lucenedoc);
}
executor.shutdown();
}
My question is :
while docImport method working, 5 threads read files and each of thread is trying to index files to same lucene index file.
So error occured some intervals : "org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#C:\lucene\index\write.lock"
For example, sometimes 30 file is getting indexed in 100 files. Others is not indexed because of error.
How can I resolve this error? How can I handle this?

Your getting this error when you attempt to open an IndexWriter when there is already a writer open on the index.
In addition to that issue, opening a new IndexWriter is a very expensive operation. Even if you were to get it working (say synchronizing a block which opens, uses and then closes the IndexWriter), this would likely be quite slow.
Instead, open one IndexWriter, keep it open, and share it across each of the threads.

How to chain multiple different InputStreams into one InputStream

I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.

It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)

This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)

Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))

Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader

Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)

Lucene Java opening too many files. Am I using IndexWriter properly?

My Lucene Java implementation is eating up too many files. I followed the instructions in the Lucene Wiki about too many open files, but that only helped slow the problem. Here is my code to add objects (PTicket) to the index:
//This gets called when the bean is instantiated
public void initializeIndex() {
analyzer = new WhitespaceAnalyzer(Version.LUCENE_32);
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
}
public void addAllToIndex(Collection<PTicket> records) {
IndexWriter indexWriter = null;
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
try{
indexWriter = new IndexWriter(directory, config);
for(PTicket record : records) {
Document doc = new Document();
StringBuffer documentText = new StringBuffer();
doc.add(new Field("_id", record.getIdAsString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("_type", record.getType(), Field.Store.YES, Field.Index.ANALYZED));
for(String key : record.getProps().keySet()) {
List<String> vals = record.getProps().get(key);
for(String val : vals) {
addToDocument(doc, key, val);
documentText.append(val).append(" ");
}
}
addToDocument(doc, DOC_TEXT, documentText.toString());
indexWriter.addDocument(doc);
}
indexWriter.optimize();
} catch (Exception e) {
e.printStackTrace();
} finally {
cleanup(indexWriter);
}
}
private void cleanup(IndexWriter iw) {
if(iw == null) {
return;
}
try{
iw.close();
} catch (IOException ioe) {
logger.error("Error trying to close index writer");
logger.error("{}", ioe.getClass().getName());
logger.error("{}", ioe.getMessage());
}
}
private void addToDocument(Document doc, String field, String value) {
doc.add(new Field(field, value, Field.Store.YES, Field.Index.ANALYZED));
}
EDIT TO ADD code for searching
public Set<Object> searchIndex(AthenaSearch search) {
try {
Query q = new QueryParser(Version.LUCENE_32, DOC_TEXT, analyzer).parse(query);
//search is actually instantiated in initialization. Lucene recommends this.
//IndexSearcher searcher = new IndexSearcher(directory, true);
TopDocs topDocs = searcher.search(q, numResults);
ScoreDoc[] hits = topDocs.scoreDocs;
for(int i=start;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
ids.add(d.get("_id"));
}
return ids;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
This code is in a web application.
1) Is this the advised way to use IndexWriter (instantiating a new one on each add to index)?
2) I've read that raising ulimit will help, but that just seems like a band-aid that won't address the actual problem.
3) Could the problem lie with IndexSearcher?

1) Is this the advised way to use
IndexWriter (instantiating a new one
on each add to index)?
i advise No, there are constructors, which will check if exists or create a new writer, in the directory containing the index. problem 2 would be solved if you reuse the indexwriter.
EDIT:
Ok it seems in Lucene 3.2 the most but one constructors are deprecated,so the resue of Indexwriter can be achieved by using Enum IndexWriterConfig.OpenMode with value CREATE_OR_APPEND.
also, opening new writer and closing on each document add is not efficient,i suggest reuse, if you want to speed up indexing, set the setRamBufferSize default value is 16MB, so do it by trial and error method
from the docs:
Note that you can open an index with
create=true even while readers are
using the index. The old readers will
continue to search the "point in time"
snapshot they had opened, and won't
see the newly created index until they
re-open.
also reuse the IndexSearcher,i cannot see the code for searching, but Indexsearcher is threadsafe and can be used as Readonly as well
also i suggest you to use MergeFactor on writer, this is not necessary but will help on limiting the creation of inverted index files, do it by trial and error method

I think we'd need to see your search code to be sure, but I'd suspect that it is a problem with the index searcher. More specifically, make sure that your index reader is being properly closed when you've finished with it.
Good luck,

The scientific correct answer would be: You can't really tell by this fragment of code.
The more constructive answer would be:
You have to make sure that there is only one IndexWriter is writing to the index at any given time and you therefor need some mechanism to make sure of that. So my answer depends of what you want to accomplish:
do you want a deeper understanding of Lucene? or..
do you just want to build and use an index?
If you answer is the latter, you probably want to look at projects like Solr, which hides all the index reading and writing.

This question is probably a duplicate of
Too many open files Error on Lucene
I am repeating here my answer for that.
Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.
IndexWriter.setUseCompoundFile(true)

Writing in the beginning of a text file Java

I need to write something into a text file's beginning. I have a text file with content and i want write something before this content. Say i have;
Good afternoon sir,how are you today?
I'm fine,how are you?
Thanks for asking,I'm great
After modifying,I want it to be like this:
Page 1-Scene 59
25.05.2011
Good afternoon sir,how are you today?
I'm fine,how are you?
Thanks for asking,I'm great
Just made up the content :) How can i modify a text file like this way?

You can't really modify it that way - file systems don't generally let you insert data in arbitrary locations - but you can:
Create a new file
Write the prefix to it
Copy the data from the old file to the new file
Move the old file to a backup location
Move the new file to the old file's location
Optionally delete the old backup file

Just in case it will be useful for someone here is full source code of method to prepend lines to a file using Apache Commons IO library. The code does not read whole file into memory, so will work on files of any size.
public static void prependPrefix(File input, String prefix) throws IOException {
LineIterator li = FileUtils.lineIterator(input);
File tempFile = File.createTempFile("prependPrefix", ".tmp");
BufferedWriter w = new BufferedWriter(new FileWriter(tempFile));
try {
w.write(prefix);
while (li.hasNext()) {
w.write(li.next());
w.write("\n");
}
} finally {
IOUtils.closeQuietly(w);
LineIterator.closeQuietly(li);
}
FileUtils.deleteQuietly(input);
FileUtils.moveFile(tempFile, input);
}

I think what you want is random access. Check out the related java tutorial. However, I don't believe you can just insert data at an arbitrary point in the file; If I recall correctly, you'd only overwrite the data. If you wanted to insert, you'd have to have your code
copy a block,
overwrite with your new stuff,
copy the next block,
overwrite with the previously copied block,
return to 3 until no more blocks

As #atk suggested, java.nio.channels.SeekableByteChannel is a good interface. But it is available from 1.7 only.
Update : If you have no issue using FileUtils then use
String fileString = FileUtils.readFileToString(file);

This isn't a direct answer to the question, but often files are accessed via InputStreams. If this is your use case, then you can chain input streams via SequenceInputStream to achieve the same result. E.g.
InputStream inputStream = new SequenceInputStream(new ByteArrayInputStream("my line\n".getBytes()), new FileInputStream(new File("myfile.txt")));

I will leave it here just in case anyone need
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
try (FileInputStream fileInputStream1 = new FileInputStream(fileName1);
FileInputStream fileInputStream2 = new FileInputStream(fileName2)) {
while (fileInputStream2.available() > 0) {
byteArrayOutputStream.write(fileInputStream2.read());
}
while (fileInputStream1.available() > 0) {
byteArrayOutputStream.write(fileInputStream1.read());
}
}
try (FileOutputStream fileOutputStream = new FileOutputStream(fileName1)) {
byteArrayOutputStream.writeTo(fileOutputStream);
}

how to delete documents using term in lucene

I am trying to delete a document by using a term in lucene index. but the code that I made below isn't working. are there any suggestion of how can I perform deleting function in lucene index?
public class DocumentDelete {
public static void main(String[] args) {
File indexDir = new File("C:/Users/Raden/Documents/lucene/LuceneHibernate/adi");
Term term = new Term(FIELD_PATH, "compatible");
Directory directory = FSDirectory.getDirectory(indexDir);
IndexReader indexReader = IndexReader.open(directory);
indexReader.deleteDocuments(term);
indexReader.close();
}
}

IndexReader indexReader = IndexReader.open(directory); // this one uses default readonly mode
instead use this:
IndexReader indexReader = IndexReader.open(directory, false); // this will open the index in edit mode and you can delete the index. . .
So you do not need any extra tool for deleting index contents. . .

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Do you need a new instance of IndexReader and IndexSearcher every time? - java

In a web application, I'm having one IndexReader and one respective IndexSearcher for the whole application. The documentation says they are thread-safe, so that's ok, but is it supposed to work properly if the underlying index changes (e.g. an IndexWriter makes changes)?

Related

Executor service in executor service?

How to chain multiple different InputStreams into one InputStream

Lucene Java opening too many files. Am I using IndexWriter properly?

Writing in the beginning of a text file Java

how to delete documents using term in lucene

Categories

Resources