Liferay Concurrent FileEntry Upload - java

Problem Statement :
In liferay i have to import a zip file in to some folder in liferay cms, So far I had implemented serial unzipping of the zip file create it's folder and then it's files. The problem here is that the whole process takes a lot of time. So I had to use parallel approach in creating folders and creating files.
My Solution :
I have used a java java.util.concurrent.ExecutorService to create a Executors.newFixedThreadPool(NTHREDS) where NTHREDS is the number of threads to be run in parallel (say 5)
I read all the folder paths from the zip and placed , list of zip
entires (files) against folder path as a key in HashMap
Traversed all keys in the map and created folders serially
Now traversed the list of zip entries (files) from map and passed to a thread worker,one file for each worker, these workers are then sent to
ExecutorService to Execute
So far i didn't find any significant reduction in time of the whole process, am i moving in the correct direction? Does liferay support concurrent file addition? What am I doing wrong?
I will be much thankful for any help in this regard
below is my code
imports
...
...
public class TestImportZip {
private static final int NTHREDS = 5;
ExecutorService executor = null;
...
...
....
Map<String,Folder> folders = new HashMap<String,Folder>();
File zipsFile = null;
public TestImportZip(............,File zipFile, .){
.
.
this.zipsFile = zipFile;
this.executor = Executors.newFixedThreadPool(NTHREDS);
}
// From here the process starts
public void importZip() {
Map<String,List<ZipEntry>> foldersMap = new HashMap<String, List<ZipEntry>>();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
zipFile.stream().forEach(entry -> {
String entryName = entry.getName();
if(entryName.contains("/")) {
String key = entryName.substring(0, entryName.lastIndexOf("/"));
List<ZipEntry> zipEntries = foldersMap.get(key);
if(zipEntries == null){
zipEntries = new ArrayList<>();
}
zipEntries.add(entry);
foldersMap.put(key,zipEntries);
}
});
createFolders(foldersMap.keySet());
createFiles(foldersMap);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void createFolders(Set<String> folderPathSets) {
// create folder and put the folder in map
.
.
.
folders.put(folderPath,folder);
}
private void createFiles(Map<String, List<ZipEntry>> foldersMap) {
.
.
.
//Traverse all the files from all the list in map and send them to worker
createFileWorker(folderPath,zipEntry);
}
private void createFileWorker(String folderPath,ZipEntry zipEntry) {
CreateEntriesWorker cfw = new CreateEntriesWorker(folderPath, zipEntry);
executor.execute(cfw);
}
class CreateEntriesWorker implements Runnable{
Folder folder = null;
ZipEntry entryToCreate = null;
public CreateEntriesWorker(String folderPath, ZipEntry zipEntry){
this.entryToCreate = zipEntry;
// get folder from already created folder map
this.folder = folders.get(folderPath);
}
public void run() {
if(this.folder != null) {
long startTime = System.currentTimeMillis();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
InputStream inputStream = zipFile.getInputStream(entryToCreate);
try{
String name = entryToCreate.getName();
// created file entry here
}catch(Exception e){
}finally{
if(inputStream != null)
inputStream.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
}

Your simplified code does not contain any Liferay reference that I recognize. The description you provide gives a hint that you're trying to optimize some code, but don't get any better performance out of your try. This typically is a sign that you're trying to optimize the wrong aspect of the problem (or it's already quite optimized).
You'll need to determine the actual bottleneck of your operation in order to know if it's feasible to optimize. There's a common saying that "premature optimization is the root of all evil". What does it mean?
I'll completely make up numbers here - don't quote me on them: They're freely invented for illustration purposes. Let's say, that your operation of adding the contents of a Zip file to Liferay's repository is distributed to the following percentages of operational resources:
4% zip file decoding/decompressing
6% file I/O for zip operations and temporary files
10% database operation for storing the files
60% for extracting text-only from word, pdf, excel and other files stored within the zip file in order to index the document in the full-text index
20% overhead of the full-text indexing library for putting together the index.
Suppose you're optimizing the zip file decoding/decompressing - what overall improvement of numbers can you expect?
While my numbers are made up: If your optimizations do not have any result, I'd recommend to reverse them, measure where you need to optimize and go after that place (or accept it and upgrade your hardware if that place is out of reach).
Run those numbers for CPU, I/O, memory and other potential bottlenecks. Identify your actual bottleneck #1, fix it, measure again. You'll see that bottleneck #2 has gotten a promotion. Rinse repeat until you're happy

Related

Slow operations in parallel

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time
This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms
But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}
One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.
So the version becomes (eschewing File in favor of Path):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.
Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).

Java create tar archive with entries of unknown size

I have a web app where I need to be able to serve the user an archive of multiple files. I've set up a generic ArchiveExporter, and made a ZipArchiveExporter. Works beautifully! I can stream my data to my server, and archive the data and stream it to the user all without using much memory, and without needing a filesystem (I'm on Google App Engine).
Then I remembered about the whole zip64 thing with 4gb zip files. My archives can get potentially very large (high res images), so I'd like to have an option to avoid zip files for my larger input.
I checked out org.apache.commons.compress.archivers.tar.TarArchiveOutputStream and thought I had found what I needed! Sadly when I checked the docs, and ran into some errors; I quickly found out you MUST pass the size of each entry as you stream. This is a problem because the data is being streamed to me with no way of knowing the size beforehand.
I tried counting and returning the written bytes from export(), but TarArchiveOutputStream expects a size in TarArchiveEntry before writing to it, so that obviously doesn't work.
I can use a ByteArrayOutputStream and read each entry entirely before writing its content so I know its size, but my entries can pontentially get very large; and this is not very polite to the other processes running on the instance.
I could use some form of persistence, upload the entry, and query the data size. However, that would be a waste of my google storage api calls, bandwidth, storage, and runtime.
I am aware of this SO question asking almost the same thing, but he settled for using zip files and there is no more relevant information.
What is the ideal solution to creating a tar archive with entries of unknown size?
public abstract class ArchiveExporter<T extends OutputStream> extends Exporter { //base class
public abstract void export(OutputStream out); //from Exporter interface
public abstract void archiveItems(T t) throws IOException;
}
public class ZipArchiveExporter extends ArchiveExporter<ZipOutputStream> { //zip class, works as intended
#Override
public void export(OutputStream out) throws IOException {
try(ZipOutputStream zos = new ZipOutputStream(out, Charsets.UTF_8)) {
zos.setLevel(0);
archiveItems(zos);
}
}
#Override
protected void archiveItems(ZipOutputStream zos) throws IOException {
zos.putNextEntry(new ZipEntry(exporter.getFileName()));
exporter.export(zos);
//chained call to export from other exporter like json exporter for instance
zos.closeEntry();
}
}
public class TarArchiveExporter extends ArchiveExporter<TarArchiveOutputStream> {
#Override
public void export(OutputStream out) throws IOException {
try(TarArchiveOutputStream taos = new TarArchiveOutputStream(out, "UTF-8")) {
archiveItems(taos);
}
}
#Override
protected void archiveItems(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
//entry.setSize(?);
taos.putArchiveEntry(entry);
exporter.export(taos);
taos.closeArchiveEntry();
}
}
EDIT this is what I was thinking with the ByteArrayOutputStream. It works, but I cannot guarantee I will always have enough memory to store the whole entry at once, hence my streaming efforts. There has to be a more elegant way of streaming a tarball! Maybe this is a question more suited for Code Review?
protected void byteArrayOutputStreamApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
try(ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
exporter.export(baos);
byte[] data = baos.toByteArray();
//holding ENTIRE entry in memory. What if it's huge? What if it has more than Integer.MAX_VALUE bytes? :[
int len = data.length;
entry.setSize(len);
taos.putArchiveEntry(entry);
taos.write(data);
taos.closeArchiveEntry();
}
}
EDIT This is what I meant by uploading the entry to a medium (Google Cloud Storage in this case) to accurately query the whole size. Seems like major overkill for what seems like a simple problem, but this doesn't suffer from the same ram problems as the solution above. Just at the cost of bandwidth and time. I hope someone smarter than me comes by and makes me feel stupid soon :D
protected void googleCloudStorageTempFileApproach(TarArchiveOutputStream taos) throws IOException {
TarArchiveEntry entry = new TarArchiveEntry(exporter.getFileName());
String name = NameHelper.getRandomName(); //get random name for temp storage
BlobInfo blobInfo = BlobInfo.newBuilder(StorageHelper.OUTPUT_BUCKET, name).build(); //prepare upload of temp file
WritableByteChannel wbc = ApiContainer.storage.writer(blobInfo); //get WriteChannel for temp file
try(OutputStream out = Channels.newOutputStream(wbc)) {
exporter.export(out); //stream items to remote temp file
} finally {
wbc.close();
}
Blob blob = ApiContainer.storage.get(blobInfo.getBlobId());
long size = blob.getSize(); //accurately query the size after upload
entry.setSize(size);
taos.putArchiveEntry(entry);
ReadableByteChannel rbc = blob.reader(); //get ReadChannel for temp file
try(InputStream in = Channels.newInputStream(rbc)) {
IOUtils.copy(in, taos); //stream back to local tar stream from remote temp file
} finally {
rbc.close();
}
blob.delete(); //delete remote temp file
taos.closeArchiveEntry();
}
I've been looking at a similar issue, and this is a constraint of tar file format, as far as I can tell.
Tar files are written as a stream, and metadata (filenames, permissions etc) are written between the file data (i.e. metadata 1, filedata 1, metadata 2, filedata 2 etc). The program that extracts the data, it reads metadata 1, then starts extracting filedata 1, but it has to have a way of knowing when it's done. This could be done a number of ways; tar does this by having the length in the metadata.
Depending on your needs, and what the recipient expects out, there are a few options that I can see (not all apply to your situation):
As you mentioned, load an entire file, work out the length, then send it.
Divide the file into blocks, of predefined length (which fits into memory), then tar them up as file1-part1, file1-part2 etc.; the last block would be short.
Divide the file into blocks of a predefined length (which don't need to fit into memory), then pad the last block to that size with something appropriate.
Work out the maximum possible size of the file, and pad to that size.
Use a different archive format.
Make your own archive format, which does not have this limitation.
Interestingly, gzip does not have predefined limits, and multiple gzips can be concatenated together, each with it's own "original filename". Unfortunately, standard gunzip extracts all the resulting data into one file, using the (?) first filename.

Non-blocking file cache (BitmapLruCache) implementation?

I am trying to create a simple demo for the ImageLoader functionality for the Android Volley Framework. Constructor is the following:
public ImageLoader(RequestQueue queue, ImageCache imageCache)
The problem is with the ImageCache. Its JavaDoc states:
Simple cache adapter interface. If provided to the ImageLoader, it
will be used as an L1 cache before dispatch to Volley. Implementations
must not block. Implementation with an LruCache is recommended.
What exactly the 'Implementations must not block' in this context means?
Is there an example of non-blocking file cache (even non-android but "pure" java) which I can use to educate my self how to convert my existing file cache to be non-blocking?
If no such exist - what may be the negative implications of using my existing implementation which is (just the reading from the file):
public byte[] get(String filename) {
byte[] ret = null;
if (filesCache.containsKey(filename)) {
FileInfo fi = filesCache.get(filename);
BufferedInputStream input;
String path = cacheDir + "/" + fi.getStorageFilename();
try {
File file = new File(path);
if (file.exists()) {
input = new BufferedInputStream(new FileInputStream(file));
ret = IOUtils.toByteArray(input);
input.close();
} else {
KhandroidLog.e("Cannot find file " + path);
}
} catch (FileNotFoundException e) {
filesCache.remove(filename);
KhandroidLog.e("Cannot find file: " + path);
} catch (IOException e) {
KhandroidLog.e(e.getMessage());
}
}
return ret;
}
What exactly the 'Implementations must not block' in this context means?
In your case, you cannot do disk I/O.
This is a Level One (L1) cache, meaning it is designed to return in a matter of microseconds, not milliseconds or seconds. That's why they advocate LruCache, which is a memory cache.
Is there an example of non-blocking file cache (even non-android but "pure" java) which I can use to educate my self how to convert my existing file cache to be non-blocking?
An L1 cache should not be a file cache.
what may be the negative implications of using my existing implementation which is (just the reading from the file)
An L1 cache should not be a file cache.
Volley already has an integrated L2 file cache, named DiskBasedCache, used for caching HTTP responses. You can substitute your own implementation of Cache for DiskBasedCache if you wish, and supply that when you create your RequestQueue.

how to intentionally corrupt a file in java

Note: Please do not judge this question. To those who think that I am doing this to "cheat"; you are mistaken, as I am no longer in school anyway. In addition, if I was, myself actually trying to cheat, I would simply use services that have already been created for this, instead of recreating the program. I took on this project because I thought it might be fun, nothing else. Before you down-vote, please consider the value of the question it's self, and not the speculative uses of it, as the purpose of SO is not to judge, but simply give the public information.
I am developing a program in java that is supposed intentionally corrupt a file (specifically a .doc, txt, or pdf, but others would be good as well)
I initially tried this:
public void corruptFile (String pathInName, String pathOutName) {
curroptMethod method = new curroptMethod();
ArrayList<Integer> corruptHash = corrupt(getBytes(pathInName));
writeBytes(corruptHash, pathOutName);
new MimetypesFileTypeMap().getContentType(new File(pathInName));
// "/home/ephraim/Desktop/testfile"
}
public ArrayList<Integer> getBytes(String filePath) {
ArrayList<Integer> fileBytes = new ArrayList<Integer>();
try {
FileInputStream myInputStream = new FileInputStream(new File(filePath));
do {
int currentByte = myInputStream.read();
if(currentByte == -1) {
System.out.println("broke loop");
break;
}
fileBytes.add(currentByte);
} while (true);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(fileBytes);
return fileBytes;
}
public void writeBytes(ArrayList<Integer> hash, String pathName) {
try {
OutputStream myOutputStream = new FileOutputStream(new File(pathName));
for (int currentHash : hash) {
myOutputStream.write(currentHash);
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//System.out.println(hash);
}
public ArrayList<Integer> corrupt(ArrayList<Integer> hash) {
ArrayList<Integer> corruptHash = new ArrayList<Integer>();
ArrayList<Integer> keywordCodeArray = new ArrayList<Integer>();
Integer keywordIndex = 0;
String keyword = "corruptthisfile";
for (int i = 0; i < keyword.length(); i++) {
keywordCodeArray.add(keyword.codePointAt(i));
}
for (Integer currentByte : hash) {
//Integer currentByteProduct = (keywordCodeArray.get(keywordIndex) + currentByte) / 2;
Integer currentByteProduct = currentByte - keywordCodeArray.get(keywordIndex);
if (currentByteProduct < 0) currentByteProduct += 255;
corruptHash.add(currentByteProduct);
if (keywordIndex == (keyword.length() - 1)) {
keywordIndex = 0;
} else keywordIndex++;
}
//System.out.println(corruptHash);
return corruptHash;
}
but the problem is that the file is still openable. When you open the file, all of the words are changed (and they may not make any sense, and they may not even be letters, but it can still be opened)
so here is my actual question:
Is there a way to make a file so corrupt that the computer doesn't know how to open it at all (ie. when you open it, the computer will say something along the lines of "this file is not recognized, and cannot be opened")?
I think you want to look into the RandomAccessFile. Also, it is almost always the case that a program recognizes its file by its very start. So open the file and scramble the first 5 bytes.
The only way to fully corrupt an arbitrary file is to replace all of its contents with random garbage. Even then, there is an infinitely small probability that the random garbage will actually be something meaningful.
Depending on the file type, it may be possible to recover from limited - or even from not so limited - corruption. E.g.:
Streaming media codecs are designed with network packet loss take into account. Limited corruption may show up as picture artifacts, or even as a few lost frames, but the content is usually still viewable.
Block-based compression algorithms, such as bzip2, allow undamaged blocks to be recovered.
File-based compression systems such as rar and zip may be able to recover those files whose compressed data has not been damaged, regardless of damage to the rest of the archive.
Human-readable text, such as text files and source code files, is still viewable in a text editor, even if parts of it are corrupt - not to mention its size that does not change. Unless you corrupted the whole thing, any casual reader would be able to tell whether an assignment was done and whether the retransmitted file was the same as the one that got corrupted.
Apart from the ethical issue, have you considered that this would be a one-time thing only? Data corruption does happen, but it's not that frequent and it's never that convenient...
If you are that desperate for more time, you would be better off breaking your leg and getting yourself admitted to a hospital.
There are better ways:
Your professor accepts Word documents. Infect it with a macro virus before sending.
"Forget" to attach the file to the email.
Forge the send date on your email. If your prof is the kind that accepts Word docs, this may work.

Most efficient way to create a large file (> 1GB)

I would like to know what is the most efficient way to create a very large dummy File in java.
The filesize should be just above 1GB. It will be used to unit test a method which only accepts files <= 1GB.
Create a sparse file. That is, open a file, seek to a position above 1GB and write some bytes.
Relevant: Create file with given size in Java
Can't you make a mock which returns filesize of > 1GB? File IO doesn't sound very unit-testy to me (although that depends on what your idea of a unit test is).
Made this function to create sparse files
private boolean createSparseFile(String filePath, Long fileSize) {
boolean success = true;
String command = "dd if=/dev/zero of=%s bs=1 count=1 seek=%s";
String formmatedCommand = String.format(command, filePath, fileSize);
String s;
Process p;
try {
p = Runtime.getRuntime().exec(formmatedCommand);
p.waitFor();
p.destroy();
} catch (IOException | InterruptedException e) {
fail(e.getLocalizedMessage());
}
return success;
}

Categories