How to read same file twice, in different tasks using executor service?
I used below sample code structure, but while reading same file simultaneously gives 409 error. How to resolve it?
//Sample code
ExecutorService ee = ExecutorService.FixedThread(2);
Callable<Object> C1 = ()-> {
InputStream in = new inputStream("https://server.com/file1");
BufferReader br = new BufferReader(in);
SysOut(br.lines);
}
Callable<Object> C2 = ()-> {
InputStream in = new inputStream("https://server.com/file1");
BufferReader br = new BufferReader(in);
SysOut(br.lines);
}
List task = new list();
task.add(c1);
task.add(c2);
ee.invokeAll(task);
ee.shutDown();
Each task individual works fine.
But when both the task are ran at same time, both tasks tries to access the same file and errors out wit 409 conflict error.
How to resolve it?
Note: I need to call same file simultaneously from different task.
I don't want to read the file once and store the content into list. Then use that list for further processing.
Related
I have a spring boot application and I am trying to merge two pdf files. The one I am getting as a byte array from another service and the one I have it locally in my resources file: /static/documents/my-file.pdf. This is the code of how I am getting byte array from my file from resources:
public static byte[] getMyPdfContentForLocale(final Locale locale) {
byte[] result = new byte[0];
try {
final File myFile = new ClassPathResource(TEMPLATES.get(locale)).getFile();
final Path filePath = Paths.get(myFile.getPath());
result = Files.readAllBytes(filePath);
} catch (IOException e) {
LOGGER.error(format("Failed to get document for local %s", locale), e);
}
return result;
}
I am getting the file and getting the byte array. Later I am trying to merge this two files with the following code:
PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
pdfMergerUtility.addSource(new ByteArrayInputStream(offerDocument));
pdfMergerUtility.addSource(new ByteArrayInputStream(merkblattDocument));
ByteArrayOutputStream os = new ByteArrayOutputStream();
pdfMergerUtility.setDestinationStream(os);
pdfMergerUtility.mergeDocuments(null);
os.toByteArray();
But unfortunately it throws an error:
throw new IOException("Page tree root must be a dictionary");
I have checked and it makes this validation before it throws it:
if (!(root.getDictionaryObject(COSName.PAGES) instanceof COSDictionary))
{
throw new IOException("Page tree root must be a dictionary");
}
And I really have no idea what does this mean and how to fix it.
The strangest thing is that I have created totally new project and tried the same code to merge two documents (the same documents) and it works!
Additionally what I have tried is:
Change the spring boot version if it is ok
Set the mergeDocuments method like this: pdfMergerUtility.mergeDocuments(setupMainMemoryOnly())
Set the mergeDocuments method like this: pdfMergerUtility.mergeDocuments(setupTempFileOnly())
Get the bytes with a different method not using the Files from java.nio:
And also executed this in a different thread
Merging files only locally stored (in resources)
Merging the file that I am getting from another service - this works btw and that is why I am sure he is ok
Can anyone help with this?
The issue as Tilman Hausherr said is in that resource filtering that you can find in your pom file. If you have a case where you are not allowed to modify this then this approach will help you:
final String path = new
ClassPathResource(TEMPLATES.get(locale)).getFile().getAbsolutePath();
final File file = new File(path);
final Path filePath = Paths.get(file.getPath());
result = Files.readAllBytes(filePath);
and then just pass the bytes to the pdfMergerUtility object (or even the whole file instead of the list of bytes).
I have created a java FX app to decompile hundreds of class files in my project using procyon decompiler. As more and more files are processed, the memory usage of the app hits 1GB in a couple of minutes. I guess this has something to do with processing string and the objects created on the process not being garbage collected?
Here is a sample code to reproduce the issue.
File file1 = new File("file1.class");
File file2 = new File("file2.class");
ExecutorService pool = Executors.newFixedThreadPool(4);
DecompileUtils dcUtils = new DecompileUtils();
for(int i=0;i<500;i++){
Callable<Integer> task = () -> {
sourceCode1 = dcUtils.decompile(file1.getAbsolutePath());
sourceCode2 = dcUtils.decompile(file2.getAbsolutePath());
//do something with the result
return 1;
};
pool.submit(task);
The class containing the method to decompile the file :
public class DecompileUtils {
public String decompile(String source) throws IOException{
final DecompilerSettings settings = DecompilerSettings.javaDefaults();
PlainTextOutput pText = new PlainTextOutput();
Decompiler.decompile(source, pText, settings);
return pText.toString();
}
}
Edit : As I was going through the procyon source code, I've noticed that it was when creating the object of the class AstBuilder that the memory usage goes up abruptly. i.e when the buildAst method is called which resides in JavaLanguage.class
AstBuilder astBuilder = buildAst(type, options);
I am trying to add two small files to a zip, as that is the format the destination requires. Both files are less than 1000kb but when I run my code, the program hangs indefinitely during zip.close(), no errors.
What am I doing wrong?
val is = new PipedInputStream()
val os = new PipedOutputStream(is)
val cos = new CountingOutputStream(os)
val zip = new ZipOutputStream(cos)
val fis = new FileInputStream(file)
zip.putNextEntry(new ZipEntry(location))
var i = 0
while(i != -1) {
zip.write(i)
i = fis.read()
}
zip.closeEntry()
fis.close()
zip.close()
When using piped streams, you need to read from the PipedInputStream at the same time you're writing to a PipedOutputStream, otherwise the pipe fills up and the writing will block.
Based on your code, you're not doing the reading part (in a separate thread of course). You can test it with a FileOutputStream, and it should write the file nicely.
I saw a few discussions on this but couldn't quite understand the right solution:
I want to load a couple hundred files from S3 into an RDD. Here is how I'm doing it now:
ObjectListing objectListing = s3.listObjects(new ListObjectsRequest().
withBucketName(...).
withPrefix(...));
List<String> keys = new LinkedList<>();
objectListing.getObjectSummaries().forEach(summery -> keys.add(summery.getKey())); // repeat while objectListing.isTruncated()
JavaRDD<String> events = sc.parallelize(keys).flatMap(new ReadFromS3Function(clusterProps));
The ReadFromS3Function does the actual reading using the AmazonS3 client:
public Iterator<String> call(String s) throws Exception {
AmazonS3 s3Client = getAmazonS3Client(properties);
S3Object object = s3Client.getObject(new GetObjectRequest(...));
InputStream is = object.getObjectContent();
List<String> lines = new LinkedList<>();
String str;
try {
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
if (is != null) {
while ((str = reader.readLine()) != null) {
lines.add(str);
}
} else {
...
}
} finally {
...
}
return lines.iterator();
I kind of "translated" this from answers I saw for the same question in Scala. I think it's also possible to pass the entire list of paths to sc.textFile(...), but I'm not sure which is the best-practice way.
the underlying problem is that listing objects in s3 is really slow, and the way it is made to look like a directory tree kills performance whenever something does a treewalk (as wildcard pattern maching of paths does).
The code in the post is doing the all-children listing which delivers way better performance, it's essentially what ships with Hadoop 2.8 and s3a listFiles(path, recursive) see HADOOP-13208.
After getting that listing, you've got strings to objects paths which you can then map to s3a/s3n paths for spark to handle as text file inputs, and which you can then apply work to
val files = keys.map(key -> s"s3a://$bucket/$key").mkString(",")
sc.textFile(files).map(...)
And as requested, here's the java code used.
String prefix = "s3a://" + properties.get("s3.source.bucket") + "/";
objectListing.getObjectSummaries().forEach(summary -> keys.add(prefix+summary.getKey()));
// repeat while objectListing truncated
JavaRDD<String> events = sc.textFile(String.join(",", keys))
Note that I switched s3n to s3a, because, provided you have the hadoop-aws and amazon-sdk JARs on your CP, the s3a connector is the one you should be using. It's better, and its the one which gets maintained and tested against spark workloads by people (me). See The history of Hadoop's S3 connectors.
You may use sc.textFile to read multiple files.
You can pass multiple file url with as its argument.
You can specify whole directories, use wildcards and even CSV of directories and wildcards.
Ex:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
Reference from this ans
I guess if you try to parallelize while reading aws will be utilizing executor and definitely improve the performance
val bucketName=xxx
val keyname=xxx
val df=sc.parallelize(new AmazonS3Client(new BasicAWSCredentials("awsccessKeyId", "SecretKey")).listObjects(request).getObjectSummaries.map(_.getKey).toList)
.flatMap { key => Source.fromInputStream(s3.getObject(bucketName, keyname).getObjectContent: InputStream).getLines }
In document import method, I work with large number of files. Each file size can also 100mb-200mb. I want to use threading in asynchronously. In for loop, each file is processed and then indexed(lucene). This operation is very cost and time useless in real time. Total operation must not stop.
General structure of import method is given below:
public void docImport()
{
ExecutorService executor = Executors.newFixedThreadPool(5);
for(final File file : fileList)
{
//Do some works...
executor.execute(new Runnable() {
#Override
public void run() {
zipFile(file); //Each zipped file has diff name and same directory.
indexFile(file); //Each file is indexed same directory.
}
});
}
executor.shutdown();
}
General structure of indexFile method :
public void indexFile()
{
ExecutorService executor = Executors.newFixedThreadPool(1);
IndexWriter writer = null;
Directory dir = .....;
Analyzer analyzer = new StandardAnalyzer(LUCENE_VERSION);
IndexWriterConfig iwc = new IndexWriterConfig(LUCENE_VERSION, analyzer);
iwc.setRAMBufferSizeMB(200);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
writer = new IndexWriter(dir, iwc);
Document lucenedoc = new Document();
lucenedoc.add(..);
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE) {
writer.addDocument(lucenedoc);
} else {
writer.updateDocument(new Term(PATH, innerPath), lucenedoc);
}
executor.shutdown();
}
My question is :
while docImport method working, 5 threads read files and each of thread is trying to index files to same lucene index file.
So error occured some intervals : "org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#C:\lucene\index\write.lock"
For example, sometimes 30 file is getting indexed in 100 files. Others is not indexed because of error.
How can I resolve this error? How can I handle this?
Your getting this error when you attempt to open an IndexWriter when there is already a writer open on the index.
In addition to that issue, opening a new IndexWriter is a very expensive operation. Even if you were to get it working (say synchronizing a block which opens, uses and then closes the IndexWriter), this would likely be quite slow.
Instead, open one IndexWriter, keep it open, and share it across each of the threads.