Executor service in executor service?

Executor service in executor service? - java

In document import method, I work with large number of files. Each file size can also 100mb-200mb. I want to use threading in asynchronously. In for loop, each file is processed and then indexed(lucene). This operation is very cost and time useless in real time. Total operation must not stop.
General structure of import method is given below:
public void docImport()
{
ExecutorService executor = Executors.newFixedThreadPool(5);
for(final File file : fileList)
{
//Do some works...
executor.execute(new Runnable() {
#Override
public void run() {
zipFile(file); //Each zipped file has diff name and same directory.
indexFile(file); //Each file is indexed same directory.
}
});
}
executor.shutdown();
}
General structure of indexFile method :
public void indexFile()
{
ExecutorService executor = Executors.newFixedThreadPool(1);
IndexWriter writer = null;
Directory dir = .....;
Analyzer analyzer = new StandardAnalyzer(LUCENE_VERSION);
IndexWriterConfig iwc = new IndexWriterConfig(LUCENE_VERSION, analyzer);
iwc.setRAMBufferSizeMB(200);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter(dir, iwc);
Document lucenedoc = new Document();
lucenedoc.add(..);
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {
writer.addDocument(lucenedoc);
} else {
writer.updateDocument(new Term(PATH, innerPath), lucenedoc);
}
executor.shutdown();
}
My question is :
while docImport method working, 5 threads read files and each of thread is trying to index files to same lucene index file.
So error occured some intervals : "org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#C:\lucene\index\write.lock"
For example, sometimes 30 file is getting indexed in 100 files. Others is not indexed because of error.
How can I resolve this error? How can I handle this?

Your getting this error when you attempt to open an IndexWriter when there is already a writer open on the index.
In addition to that issue, opening a new IndexWriter is a very expensive operation. Even if you were to get it working (say synchronizing a block which opens, uses and then closes the IndexWriter), this would likely be quite slow.
Instead, open one IndexWriter, keep it open, and share it across each of the threads.

Related

Slow operations in parallel

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time
This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms
But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}

One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.
So the version becomes (eschewing File in favor of Path):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.

Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).

Java - read a file using executor service gives error 409

How to read same file twice, in different tasks using executor service?
I used below sample code structure, but while reading same file simultaneously gives 409 error. How to resolve it?
//Sample code
ExecutorService ee = ExecutorService.FixedThread(2);
Callable<Object> C1 = ()-> {
InputStream in = new inputStream("https://server.com/file1");
BufferReader br = new BufferReader(in);
SysOut(br.lines);
}
Callable<Object> C2 = ()-> {
InputStream in = new inputStream("https://server.com/file1");
BufferReader br = new BufferReader(in);
SysOut(br.lines);
}
List task = new list();
task.add(c1);
task.add(c2);
ee.invokeAll(task);
ee.shutDown();
Each task individual works fine.
But when both the task are ran at same time, both tasks tries to access the same file and errors out wit 409 conflict error.
How to resolve it?
Note: I need to call same file simultaneously from different task.
I don't want to read the file once and store the content into list. Then use that list for further processing.

Write multiple files with same string without hanging the UI

I am working on an Android App that changes the CPU Frequency when a foreground app changes. The frequencies for the foreground app is defined in my application itself. But while changing the frequencies my app has to open multiple system files and replace the frequency with my text. This makes my UI slow and when I change apps continuously, it makes the systemUI crash. What can I do to write these multiple files all together at the same time?
I have tried using ASynctaskLoader but that too crashes the SystemUI later.
public static boolean setFreq(String max_freq, String min_freq) {
ByteArrayInputStream inputStream = new ByteArrayInputStream(max_freq.getBytes(Charset.forName("UTF-8")));
ByteArrayInputStream inputStream1 = new ByteArrayInputStream(min_freq.getBytes(Charset.forName("UTF-8")));
SuFileOutputStream outputStream;
SuFileOutputStream outputStream1;
try {
if (max_freq != null) {
int cpus = 0;
while (true) {
SuFile f = new SuFile(CPUActivity.MAX_FREQ_PATH.replace("cpu0", "cpu" + cpus));
SuFile f1 = new SuFile(CPUActivity.MIN_FREQ_PATH.replace("cpu0", "cpu" + cpus));
outputStream = new SuFileOutputStream(f);
outputStream1 = new SuFileOutputStream(f1);
ShellUtils.pump(inputStream, outputStream);
ShellUtils.pump(inputStream1, outputStream1);
if (!f.exists()) {
break;
}
cpus++;
}
}
} catch (Exception ex) {
}
return true;
}

I assume SuFile and SuFileOutputStream are your custom implementations extending Java File and FileOutputStream classes.
Couple of points need to be fixed first.
f.exists() check should be before initializing OutputStream, otherwise it will create the file before checking exists or not. This makes your while loop to become an infinite loop.
as #Daryll suggested, use the number of CPUs with while/for loop. I suggest using for loop.
close your streams after pump(..) method call.
If you want to keep the main thread free, then you can do something like this,
see this code segment:
public static void setFreq(final String max_freq, final String min_freq) {
new Thread(new Runnable() {
//Put all the stuff here
}).start();
}
This should solve your problem.

Determine the number of CPUs before hand and use that number in your loop rather than using a while (true) having to do SuFile.exists() every cycle.
I don't know what SuFileOutputStream is but you may need to close those file output streams or find a faster way to write the file if that implementation is too slow.

Liferay Concurrent FileEntry Upload

Problem Statement :
In liferay i have to import a zip file in to some folder in liferay cms, So far I had implemented serial unzipping of the zip file create it's folder and then it's files. The problem here is that the whole process takes a lot of time. So I had to use parallel approach in creating folders and creating files.
My Solution :
I have used a java java.util.concurrent.ExecutorService to create a Executors.newFixedThreadPool(NTHREDS) where NTHREDS is the number of threads to be run in parallel (say 5)
I read all the folder paths from the zip and placed , list of zip
entires (files) against folder path as a key in HashMap
Traversed all keys in the map and created folders serially
Now traversed the list of zip entries (files) from map and passed to a thread worker,one file for each worker, these workers are then sent to
ExecutorService to Execute
So far i didn't find any significant reduction in time of the whole process, am i moving in the correct direction? Does liferay support concurrent file addition? What am I doing wrong?
I will be much thankful for any help in this regard
below is my code
imports
...
...
public class TestImportZip {
private static final int NTHREDS = 5;
ExecutorService executor = null;
...
...
....
Map<String,Folder> folders = new HashMap<String,Folder>();
File zipsFile = null;
public TestImportZip(............,File zipFile, .){
.
.
this.zipsFile = zipFile;
this.executor = Executors.newFixedThreadPool(NTHREDS);
}
// From here the process starts
public void importZip() {
Map<String,List<ZipEntry>> foldersMap = new HashMap<String, List<ZipEntry>>();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
zipFile.stream().forEach(entry -> {
String entryName = entry.getName();
if(entryName.contains("/")) {
String key = entryName.substring(0, entryName.lastIndexOf("/"));
List<ZipEntry> zipEntries = foldersMap.get(key);
if(zipEntries == null){
zipEntries = new ArrayList<>();
}
zipEntries.add(entry);
foldersMap.put(key,zipEntries);
}
});
createFolders(foldersMap.keySet());
createFiles(foldersMap);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void createFolders(Set<String> folderPathSets) {
// create folder and put the folder in map
.
.
.
folders.put(folderPath,folder);
}
private void createFiles(Map<String, List<ZipEntry>> foldersMap) {
.
.
.
//Traverse all the files from all the list in map and send them to worker
createFileWorker(folderPath,zipEntry);
}
private void createFileWorker(String folderPath,ZipEntry zipEntry) {
CreateEntriesWorker cfw = new CreateEntriesWorker(folderPath, zipEntry);
executor.execute(cfw);
}
class CreateEntriesWorker implements Runnable{
Folder folder = null;
ZipEntry entryToCreate = null;
public CreateEntriesWorker(String folderPath, ZipEntry zipEntry){
this.entryToCreate = zipEntry;
// get folder from already created folder map
this.folder = folders.get(folderPath);
}
public void run() {
if(this.folder != null) {
long startTime = System.currentTimeMillis();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
InputStream inputStream = zipFile.getInputStream(entryToCreate);
try{
String name = entryToCreate.getName();
// created file entry here
}catch(Exception e){
}finally{
if(inputStream != null)
inputStream.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
}

Your simplified code does not contain any Liferay reference that I recognize. The description you provide gives a hint that you're trying to optimize some code, but don't get any better performance out of your try. This typically is a sign that you're trying to optimize the wrong aspect of the problem (or it's already quite optimized).
You'll need to determine the actual bottleneck of your operation in order to know if it's feasible to optimize. There's a common saying that "premature optimization is the root of all evil". What does it mean?
I'll completely make up numbers here - don't quote me on them: They're freely invented for illustration purposes. Let's say, that your operation of adding the contents of a Zip file to Liferay's repository is distributed to the following percentages of operational resources:
4% zip file decoding/decompressing
6% file I/O for zip operations and temporary files
10% database operation for storing the files
60% for extracting text-only from word, pdf, excel and other files stored within the zip file in order to index the document in the full-text index
20% overhead of the full-text indexing library for putting together the index.
Suppose you're optimizing the zip file decoding/decompressing - what overall improvement of numbers can you expect?
While my numbers are made up: If your optimizations do not have any result, I'd recommend to reverse them, measure where you need to optimize and go after that place (or accept it and upgrade your hardware if that place is out of reach).
Run those numbers for CPU, I/O, memory and other potential bottlenecks. Identify your actual bottleneck #1, fix it, measure again. You'll see that bottleneck #2 has gotten a promotion. Rinse repeat until you're happy

Lucene Java opening too many files. Am I using IndexWriter properly?

My Lucene Java implementation is eating up too many files. I followed the instructions in the Lucene Wiki about too many open files, but that only helped slow the problem. Here is my code to add objects (PTicket) to the index:
//This gets called when the bean is instantiated
public void initializeIndex() {
analyzer = new WhitespaceAnalyzer(Version.LUCENE_32);
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
}
public void addAllToIndex(Collection<PTicket> records) {
IndexWriter indexWriter = null;
config = new IndexWriterConfig(Version.LUCENE_32, analyzer);
try{
indexWriter = new IndexWriter(directory, config);
for(PTicket record : records) {
Document doc = new Document();
StringBuffer documentText = new StringBuffer();
doc.add(new Field("_id", record.getIdAsString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("_type", record.getType(), Field.Store.YES, Field.Index.ANALYZED));
for(String key : record.getProps().keySet()) {
List<String> vals = record.getProps().get(key);
for(String val : vals) {
addToDocument(doc, key, val);
documentText.append(val).append(" ");
}
}
addToDocument(doc, DOC_TEXT, documentText.toString());
indexWriter.addDocument(doc);
}
indexWriter.optimize();
} catch (Exception e) {
e.printStackTrace();
} finally {
cleanup(indexWriter);
}
}
private void cleanup(IndexWriter iw) {
if(iw == null) {
return;
}
try{
iw.close();
} catch (IOException ioe) {
logger.error("Error trying to close index writer");
logger.error("{}", ioe.getClass().getName());
logger.error("{}", ioe.getMessage());
}
}
private void addToDocument(Document doc, String field, String value) {
doc.add(new Field(field, value, Field.Store.YES, Field.Index.ANALYZED));
}
EDIT TO ADD code for searching
public Set<Object> searchIndex(AthenaSearch search) {
try {
Query q = new QueryParser(Version.LUCENE_32, DOC_TEXT, analyzer).parse(query);
//search is actually instantiated in initialization. Lucene recommends this.
//IndexSearcher searcher = new IndexSearcher(directory, true);
TopDocs topDocs = searcher.search(q, numResults);
ScoreDoc[] hits = topDocs.scoreDocs;
for(int i=start;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
ids.add(d.get("_id"));
}
return ids;
} catch (Exception e) {
e.printStackTrace();
return null;
}
}
This code is in a web application.
1) Is this the advised way to use IndexWriter (instantiating a new one on each add to index)?
2) I've read that raising ulimit will help, but that just seems like a band-aid that won't address the actual problem.
3) Could the problem lie with IndexSearcher?

1) Is this the advised way to use
IndexWriter (instantiating a new one
on each add to index)?
i advise No, there are constructors, which will check if exists or create a new writer, in the directory containing the index. problem 2 would be solved if you reuse the indexwriter.
EDIT:
Ok it seems in Lucene 3.2 the most but one constructors are deprecated,so the resue of Indexwriter can be achieved by using Enum IndexWriterConfig.OpenMode with value CREATE_OR_APPEND.
also, opening new writer and closing on each document add is not efficient,i suggest reuse, if you want to speed up indexing, set the setRamBufferSize default value is 16MB, so do it by trial and error method
from the docs:
Note that you can open an index with
create=true even while readers are
using the index. The old readers will
continue to search the "point in time"
snapshot they had opened, and won't
see the newly created index until they
re-open.
also reuse the IndexSearcher,i cannot see the code for searching, but Indexsearcher is threadsafe and can be used as Readonly as well
also i suggest you to use MergeFactor on writer, this is not necessary but will help on limiting the creation of inverted index files, do it by trial and error method

I think we'd need to see your search code to be sure, but I'd suspect that it is a problem with the index searcher. More specifically, make sure that your index reader is being properly closed when you've finished with it.
Good luck,

The scientific correct answer would be: You can't really tell by this fragment of code.
The more constructive answer would be:
You have to make sure that there is only one IndexWriter is writing to the index at any given time and you therefor need some mechanism to make sure of that. So my answer depends of what you want to accomplish:
do you want a deeper understanding of Lucene? or..
do you just want to build and use an index?
If you answer is the latter, you probably want to look at projects like Solr, which hides all the index reading and writing.

This question is probably a duplicate of
Too many open files Error on Lucene
I am repeating here my answer for that.
Use compound index to reduce file count. When this flag is set, lucene will write a segment as single .cfs file instead of multiple files. This will reduce the number of files significantly.
IndexWriter.setUseCompoundFile(true)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.