Non-blocking file cache (BitmapLruCache) implementation? - java

I am trying to create a simple demo for the ImageLoader functionality for the Android Volley Framework. Constructor is the following:
public ImageLoader(RequestQueue queue, ImageCache imageCache)
The problem is with the ImageCache. Its JavaDoc states:
Simple cache adapter interface. If provided to the ImageLoader, it
will be used as an L1 cache before dispatch to Volley. Implementations
must not block. Implementation with an LruCache is recommended.
What exactly the 'Implementations must not block' in this context means?
Is there an example of non-blocking file cache (even non-android but "pure" java) which I can use to educate my self how to convert my existing file cache to be non-blocking?
If no such exist - what may be the negative implications of using my existing implementation which is (just the reading from the file):
public byte[] get(String filename) {
byte[] ret = null;
if (filesCache.containsKey(filename)) {
FileInfo fi = filesCache.get(filename);
BufferedInputStream input;
String path = cacheDir + "/" + fi.getStorageFilename();
try {
File file = new File(path);
if (file.exists()) {
input = new BufferedInputStream(new FileInputStream(file));
ret = IOUtils.toByteArray(input);
input.close();
} else {
KhandroidLog.e("Cannot find file " + path);
}
} catch (FileNotFoundException e) {
filesCache.remove(filename);
KhandroidLog.e("Cannot find file: " + path);
} catch (IOException e) {
KhandroidLog.e(e.getMessage());
}
}
return ret;
}

What exactly the 'Implementations must not block' in this context means?
In your case, you cannot do disk I/O.
This is a Level One (L1) cache, meaning it is designed to return in a matter of microseconds, not milliseconds or seconds. That's why they advocate LruCache, which is a memory cache.
Is there an example of non-blocking file cache (even non-android but "pure" java) which I can use to educate my self how to convert my existing file cache to be non-blocking?
An L1 cache should not be a file cache.
what may be the negative implications of using my existing implementation which is (just the reading from the file)
An L1 cache should not be a file cache.
Volley already has an integrated L2 file cache, named DiskBasedCache, used for caching HTTP responses. You can substitute your own implementation of Cache for DiskBasedCache if you wish, and supply that when you create your RequestQueue.

Related

Slow operations in parallel

I need help with running parallel operations. The goal of the code is to extract a large amount of small files from the same tar in different folders in a very short time
This is the code:
public void decompress(File archive, File destination) throws RuntimeException {
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in);
TarArchiveInputStream is = (TarArchiveInputStream) new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
File file = new File(destination, entry.getName());
file.getParentFile().mkdirs();
Files.write(file.toPath(), is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
When I execute one time this operation, it takes ~900ms
But when I do something like this to execute the same operation, multiple times in parallel it takes 20000ms:
ExecutorService EXECUTOR_SERVICE = Executors.newFixedThreadPool(20);
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
EXECUTOR_SERVICE.submit(() -> decompress(archive, directory));
}
or
File archive = ...;
for (int i = 0; i < 5; i++) {
File directory = new File("Dir_" + i);
new Thread(() -> decompress(archive, directory)).start();
}
One suspicion is that the directories contain many files, hence File.mkdirs does needlessly much checks.
The constructor of BufferedInputStream may have a custom buffer size. Never helped much, but it might be with your disk. Also with parallelism it could help to prevent much "disk head movements."
You probably already tried Files.copy but still, it might have a better memory behavior that readAllBytes.
So the version becomes (eschewing File in favor of Path):
public void decompress(File archive, File destination) throws RuntimeException {
final int bufferSize = 1024 * 128;
Path archivePath = archive.toPath();
Path destinationPath = destination.toPath();
try (InputStream in = new FileInputStream(archive);
BufferedInputStream buff = new BufferedInputStream(in, bufferSize);
TarArchiveInputStream is = (TarArchiveInputStream)
new ArchiveStreamFactory().createArchiveInputStream("tar", buff)
) {
Path oldFileParent = destinationPath;
oldFileParent.createDirectories();
TarArchiveEntry entry;
while ((entry = is.getNextTarEntry()) != null) {
Path file = Paths.get(destinationPath, entry.getName());
Path fileParent = file.getParent();
if (!fileParent.equals(oldFileParent)) {
oldFileParent = fileParent;
oldFileParent.createDirectories();
}
Files.copy(is, file);
//Files.write(file, is.readAllBytes());
}
} catch (IOException | ArchiveException e) {
e.printStackTrace();
}
}
Throwing a RuntimeException and capturing the IOException/ArchiveException without throwing it back (as new IllegalStateException(e)) is a matter of taste.
Now to adding parallelism: disk output is probably the bottleneck. Writing two files to the same disk in parallel means skipping back and forth on the disk. Small files might just do.
Better seems to parallelize reading a next file and then in another thread write it.
Two threads might theoretically perform better than many threads with enhightened disk traffic. readAllBytes might then be appropriate, to let the writing thread not use is.
As in the tar entry maybe the file size is kept too, that would allow to check whether readAllBytes is efficient enough - for large files.
Logging was mentioned in this question. It is known, that that can consume much time, and with parallelism becomes even more critical. But you seem to be aware of it. You wrote having written your own logger. For a library System.Logger is actually best. It is a façade that uses any logger the application provides. This would have prevented the logger vulnaribility hidden in library dependencies of the past year.
Ignoring the fact that you are not decompressing the file in parallel here (you are running multiple threads decompressing the same file concurrently, essentially overwriting the result), there may be several reasons for this performance hit. I/O is one, so it depends on the underlying implementation. Also, what is the Logger you are using there? While other parts of your code doesn't seem to be shared among multiple threads, the static call to Logger is something that is shared.
Also note: java.nio uses FileChannels which provide synchronous I/O, so depending on how you create the channels, you may get into similar situations (though I don't believe this applies here).

Read/Write Bytes to and From a File Using Only Java.IO

How can we write a byte array to a file (and read it back from that file) in Java?
Yes, we all know there are already lots of questions like that, but they get very messy and subjective due to the fact that there are so many ways to accomplish this task.
So let's reduce the scope of the question:
Domain:
Android / Java
What we want:
Fast (as possible)
Bug-free (in a rigidly meticulous way)
What we are not doing:
Third-party libraries
Any libraries that require Android API later than 23 (Marshmallow)
(So, that rules out Apache Commons, Google Guava, Java.nio, and leaves us with good ol' Java.io)
What we need:
Byte array is always exactly the same (content and size) after going through the write-then-read process
Write method only requires two arguments: File file, and byte[] data
Read method returns a byte[] and only requires one argument: File file
In my particular case, these methods are private (not a library) and are NOT responsible for the following, (but if you want to create a more universal solution that applies to a wider audience, go for it):
Thread-safety (file will not be accessed by more than one process at once)
File being null
File pointing to non-existent location
Lack of permissions at the file location
Byte array being too large
Byte array being null
Dealing with any "index," "length," or "append" arguments/capabilities
So... we're sort of in search of the definitive bullet-proof code that people in the future can assume is safe to use because your answer has lots of up-votes and there are no comments that say, "That might crash if..."
This is what I have so far:
Write Bytes To File:
private void writeBytesToFile(final File file, final byte[] data) {
try {
FileOutputStream fos = new FileOutputStream(file);
fos.write(data);
fos.close();
} catch (Exception e) {
Log.i("XXX", "BUG: " + e);
}
}
Read Bytes From File:
private byte[] readBytesFromFile(final File file) {
RandomAccessFile raf;
byte[] bytesToReturn = new byte[(int) file.length()];
try {
raf = new RandomAccessFile(file, "r");
raf.readFully(bytesToReturn);
} catch (Exception e) {
Log.i("XXX", "BUG: " + e);
}
return bytesToReturn;
}
From what I've read, the possible Exceptions are:
FileNotFoundException : Am I correct that this should not happen as long as the file path being supplied was derived using Android's own internal tools and/or if the app was tested properly?
IOException : I don't really know what could cause this... but I'm assuming that there's no way around it if it does.
So with that in mind... can these methods be improved or replaced, and if so, with what?
It looks like these are going to be core utility/library methods which must run on Android API 23 or later.
Concerning library methods, I find it best to make no assumptions on how applications will use these methods. In some cases the applications may want to receive checked IOExceptions (because data from a file must exist for the application to work), in other cases the applications may not even care if data is not available (because data from a file is only cache that is also available from a primary source).
When it comes to I/O operations, there is never a guarantee that operations will succeed (e.g. user dropping phone in the toilet). The library should reflect that and give the application a choice on how to handle errors.
To optimize I/O performance always assume the "happy path" and catch errors to figure out what went wrong. This is counter intuitive to normal programming but essential in dealing with storage I/O. For example, just checking if a file exists before reading from a file can make your application twice as slow - all these kind of I/O actions add up fast to slow your application down. Just assume the file exists and if you get an error, only then check if the file exists.
So given those ideas, the main functions could look like:
public static void writeFile(File f, byte[] data) throws FileNotFoundException, IOException {
try (FileOutputStream out = new FileOutputStream(f)) {
out.write(data);
}
}
public static int readFile(File f, byte[] data) throws FileNotFoundException, IOException {
try (FileInputStream in = new FileInputStream(f)) {
return in.read(data);
}
}
Notes about the implementation:
The methods can also throw runtime-exceptions like NullPointerExceptions - these methods are never going to be "bug free".
I do not think buffering is needed/wanted in the methods above since only one native call is done
(see also here).
The application now also has the option to read only the beginning of a file.
To make it easier for an application to read a file, an additional method can be added. But note that it is up to the library to detect any errors and report them to the application since the application itself can no longer detect those errors.
public static byte[] readFile(File f) throws FileNotFoundException, IOException {
int fsize = verifyFileSize(f);
byte[] data = new byte[fsize];
int read = readFile(f, data);
verifyAllDataRead(f, data, read);
return data;
}
private static int verifyFileSize(File f) throws IOException {
long fsize = f.length();
if (fsize > Integer.MAX_VALUE) {
throw new IOException("File size (" + fsize + " bytes) for " + f.getName() + " too large.");
}
return (int) fsize;
}
public static void verifyAllDataRead(File f, byte[] data, int read) throws IOException {
if (read != data.length) {
throw new IOException("Expected to read " + data.length
+ " bytes from file " + f.getName() + " but got only " + read + " bytes from file.");
}
}
This implementation adds another hidden point of failure: OutOfMemory at the point where the new data array is created.
To accommodate applications further, additional methods can be added to help with different scenario's. For example, let's say the application really does not want to deal with checked exceptions:
public static void writeFileData(File f, byte[] data) {
try {
writeFile(f, data);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
}
public static byte[] readFileData(File f) {
try {
return readFile(f);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
return null;
}
public static int readFileData(File f, byte[] data) {
try {
return readFile(f, data);
} catch (Exception e) {
fileExceptionToRuntime(e);
}
return -1;
}
private static void fileExceptionToRuntime(Exception e) {
if (e instanceof RuntimeException) { // e.g. NullPointerException
throw (RuntimeException)e;
}
RuntimeException re = new RuntimeException(e.toString());
re.setStackTrace(e.getStackTrace());
throw re;
}
The method fileExceptionToRuntime is a minimal implementation, but it shows the idea here.
The library could also help an application to troubleshoot when an error does occur. For example, a method canReadFile(File f) could check if a file exists and is readable and is not too large. The application could call such a function after a file-read fails and check for common reasons why a file cannot be read. The same can be done for writing to a file.
Although you can't use third party libraries, you can still read their code and learn from their experience. In Google Guava for example, you usually read a file into bytes like this:
FileInputStream reader = new FileInputStream("test.txt");
byte[] result = ByteStreams.toByteArray(reader);
The core implementation of this is toByteArrayInternal. Before calling this, you should check:
A not null file is passed (NullPointerException)
The file exists (FileNotFoundException)
After that, it is reduced to handling an InputStream and this where IOExceptions come from. When reading streams a lot of things out of the control of your application can go wrong (bad sectors and other hardware issues, mal-functioning drivers, OS access rights) and manifest themselves with an IOException.
I am copying here the implementation:
private static final int BUFFER_SIZE = 8192;
/** Max array length on JVM. */
private static final int MAX_ARRAY_LEN = Integer.MAX_VALUE - 8;
private static byte[] toByteArrayInternal(InputStream in, Queue<byte[]> bufs, int totalLen)
throws IOException {
// Starting with an 8k buffer, double the size of each successive buffer. Buffers are retained
// in a deque so that there's no copying between buffers while reading and so all of the bytes
// in each new allocated buffer are available for reading from the stream.
for (int bufSize = BUFFER_SIZE;
totalLen < MAX_ARRAY_LEN;
bufSize = IntMath.saturatedMultiply(bufSize, 2)) {
byte[] buf = new byte[Math.min(bufSize, MAX_ARRAY_LEN - totalLen)];
bufs.add(buf);
int off = 0;
while (off < buf.length) {
// always OK to fill buf; its size plus the rest of bufs is never more than MAX_ARRAY_LEN
int r = in.read(buf, off, buf.length - off);
if (r == -1) {
return combineBuffers(bufs, totalLen);
}
off += r;
totalLen += r;
}
}
// read MAX_ARRAY_LEN bytes without seeing end of stream
if (in.read() == -1) {
// oh, there's the end of the stream
return combineBuffers(bufs, MAX_ARRAY_LEN);
} else {
throw new OutOfMemoryError("input is too large to fit in a byte array");
}
}
As you can see most of the logic has to do with reading the file in chunks. This is to handle situations, where you don't know the size of the InputStream, before starting reading. In your case, you only need to read files and you should be able to know the length beforehand, so this complexity could be avoided.
The other check is OutOfMemoryException. In standard Java the limit is too big, however in Android, it will be a much smaller value. You should check, before trying to read the file that there is enough memory available.

Liferay Concurrent FileEntry Upload

Problem Statement :
In liferay i have to import a zip file in to some folder in liferay cms, So far I had implemented serial unzipping of the zip file create it's folder and then it's files. The problem here is that the whole process takes a lot of time. So I had to use parallel approach in creating folders and creating files.
My Solution :
I have used a java java.util.concurrent.ExecutorService to create a Executors.newFixedThreadPool(NTHREDS) where NTHREDS is the number of threads to be run in parallel (say 5)
I read all the folder paths from the zip and placed , list of zip
entires (files) against folder path as a key in HashMap
Traversed all keys in the map and created folders serially
Now traversed the list of zip entries (files) from map and passed to a thread worker,one file for each worker, these workers are then sent to
ExecutorService to Execute
So far i didn't find any significant reduction in time of the whole process, am i moving in the correct direction? Does liferay support concurrent file addition? What am I doing wrong?
I will be much thankful for any help in this regard
below is my code
imports
...
...
public class TestImportZip {
private static final int NTHREDS = 5;
ExecutorService executor = null;
...
...
....
Map<String,Folder> folders = new HashMap<String,Folder>();
File zipsFile = null;
public TestImportZip(............,File zipFile, .){
.
.
this.zipsFile = zipFile;
this.executor = Executors.newFixedThreadPool(NTHREDS);
}
// From here the process starts
public void importZip() {
Map<String,List<ZipEntry>> foldersMap = new HashMap<String, List<ZipEntry>>();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
zipFile.stream().forEach(entry -> {
String entryName = entry.getName();
if(entryName.contains("/")) {
String key = entryName.substring(0, entryName.lastIndexOf("/"));
List<ZipEntry> zipEntries = foldersMap.get(key);
if(zipEntries == null){
zipEntries = new ArrayList<>();
}
zipEntries.add(entry);
foldersMap.put(key,zipEntries);
}
});
createFolders(foldersMap.keySet());
createFiles(foldersMap);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void createFolders(Set<String> folderPathSets) {
// create folder and put the folder in map
.
.
.
folders.put(folderPath,folder);
}
private void createFiles(Map<String, List<ZipEntry>> foldersMap) {
.
.
.
//Traverse all the files from all the list in map and send them to worker
createFileWorker(folderPath,zipEntry);
}
private void createFileWorker(String folderPath,ZipEntry zipEntry) {
CreateEntriesWorker cfw = new CreateEntriesWorker(folderPath, zipEntry);
executor.execute(cfw);
}
class CreateEntriesWorker implements Runnable{
Folder folder = null;
ZipEntry entryToCreate = null;
public CreateEntriesWorker(String folderPath, ZipEntry zipEntry){
this.entryToCreate = zipEntry;
// get folder from already created folder map
this.folder = folders.get(folderPath);
}
public void run() {
if(this.folder != null) {
long startTime = System.currentTimeMillis();
try (ZipFile zipFile = new ZipFile(zipsFile)) {
InputStream inputStream = zipFile.getInputStream(entryToCreate);
try{
String name = entryToCreate.getName();
// created file entry here
}catch(Exception e){
}finally{
if(inputStream != null)
inputStream.close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
}
Your simplified code does not contain any Liferay reference that I recognize. The description you provide gives a hint that you're trying to optimize some code, but don't get any better performance out of your try. This typically is a sign that you're trying to optimize the wrong aspect of the problem (or it's already quite optimized).
You'll need to determine the actual bottleneck of your operation in order to know if it's feasible to optimize. There's a common saying that "premature optimization is the root of all evil". What does it mean?
I'll completely make up numbers here - don't quote me on them: They're freely invented for illustration purposes. Let's say, that your operation of adding the contents of a Zip file to Liferay's repository is distributed to the following percentages of operational resources:
4% zip file decoding/decompressing
6% file I/O for zip operations and temporary files
10% database operation for storing the files
60% for extracting text-only from word, pdf, excel and other files stored within the zip file in order to index the document in the full-text index
20% overhead of the full-text indexing library for putting together the index.
Suppose you're optimizing the zip file decoding/decompressing - what overall improvement of numbers can you expect?
While my numbers are made up: If your optimizations do not have any result, I'd recommend to reverse them, measure where you need to optimize and go after that place (or accept it and upgrade your hardware if that place is out of reach).
Run those numbers for CPU, I/O, memory and other potential bottlenecks. Identify your actual bottleneck #1, fix it, measure again. You'll see that bottleneck #2 has gotten a promotion. Rinse repeat until you're happy

How to download a large file from Google Cloud Storage using Java with checksum control

I want to download large files from Google Cloud Storage using the google provided Java library com.google.cloud.storage. I have working code, but I still have one question and one major concern:
My major concern is, when is the file content actually downloaded? During (references to the code below) storage.get(blobId), during blob.reader() or during reader.read(bytes)? This gets very important when it comes to how to handle an invalid checksum, what do I need to do in order to actually trigger that the file is fetched over the network again?
The simpler question is: Is there built in functionality to do md5 (or crc32c) check on the received file in the google library? Maybe I don't need to implement it on my own.
Here is my method trying to download big files from Google Cloud Storage:
private static final int MAX_NUMBER_OF_TRIES = 3;
public Path downloadFile(String storageFileName, String bucketName) throws IOException {
// In my real code, this is a field populated in the constructor.
Storage storage = Objects.requireNonNull(StorageOptions.getDefaultInstance().getService());
BlobId blobId = BlobId.of(bucketName, storageFileName);
Path outputFile = Paths.get(storageFileName.replaceAll("/", "-"));
int retryCounter = 1;
Blob blob;
boolean checksumOk;
MessageDigest messageDigest;
try {
messageDigest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException ex) {
throw new RuntimeException(ex);
}
do {
LOGGER.debug("Start download file {} from bucket {} to Content Store (try {})", storageFileName, bucketName, retryCounter);
blob = storage.get(blobId);
if (null == blob) {
throw new CloudStorageCommunicationException("Failed to download file after " + retryCounter + " tries.");
}
if (Files.exists(outputFile)) {
Files.delete(outputFile);
}
try (ReadChannel reader = blob.reader();
FileChannel channel = new FileOutputStream(outputFile.toFile(), true).getChannel()) {
ByteBuffer bytes = ByteBuffer.allocate(128 * 1024);
int bytesRead = reader.read(bytes);
while (bytesRead > 0) {
bytes.flip();
messageDigest.update(bytes.array(), 0, bytesRead);
channel.write(bytes);
bytes.clear();
bytesRead = reader.read(bytes);
}
}
String checksum = Base64.encodeBase64String(messageDigest.digest());
checksumOk = checksum.equals(blob.getMd5());
if (!checksumOk) {
Files.delete(outputFile);
messageDigest.reset();
}
} while (++retryCounter <= MAX_NUMBER_OF_TRIES && !checksumOk);
if (!checksumOk) {
throw new CloudStorageCommunicationException("Failed to download file after " + MAX_NUMBER_OF_TRIES + " tries.");
}
return outputFile;
}
The google-cloud-java storage library does not validate checksums on its own when reading data beyond normal HTTPS/TCP correctness checking. If it compared the MD5 of the received data to the known MD5, it would need to download the entire file before it could return any results from read(), which for very large files would be infeasible.
What you're doing is a good idea if you need the additional protection of comparing MD5s. If this is a one-off task, you could use the gsutil command-line tool, which does this same sort of additional check.
As the JavaDoc of ReadChannel says:
Implementations of this class may buffer data internally to reduce remote calls.
So the implementation you get from blob.reader() could cache the whole file, some bytes or nothing and just fetch byte for byte when you call read(). You will never know and you shouldn't care.
As only read() throws an IOException and the other methods you used do not, I'd say that only calling read() will actually download stuff. You can also see this in the sources of the lib.
Btw. despite the example in the JavaDocs of the library, you should check for >= 0, not > 0. 0 just means nothing was read, not that end of stream is reached. End of stream is signaled by returning -1.
For retrying after a failed checksum check, get a new reader from the blob. If something caches the downloaded data, then the reader itself. So if you get a new reader from the blob, the file will be redownloaded from remote.

Disabling Multipart Caching in CXF jax-rs

I posted this question to the CXF list, without any luck. So here we go. I am trying to upload large files to a remote server (think of them virtual machine disks). So I have a restful service that accepts upload requests. The handler for the upload looks like:
#POST
#Consumes(MediaType.MULTIPART_FORM_DATA)
#Path("/doupload")
public Response receiveStream(MultipartBody multipart) {
List<Attachment> allAttachments = body.getAllAttachments();
Attachment att = null;
for (Attachment b : allAttachments) {
if (UPLOAD_FILE_DESCRIPTOR.equals(b.getContentId())) {
att = b;
}
}
Assert.notNull(att);
DataHandler dh = att.getDataHandler();
if (dh == null) {
throw new WebApplicationException(HTTP_BAD_REQUEST);
}
try {
InputStream is = dh.getInputStream();
byte[] buf = new byte[65536];
int n;
OutputStream os = getOutputStream();
while ((n = is.read(buf)) > 0) {
os.write(buf, 0, n);
}
ResponseBuilder rb = Response.status(HTTP_CREATED);
return rb.build();
} catch (IOException e) {
log.error("Got exception=", e);
throw new WebApplicationException(HTTP_INTERNAL_ERROR);
} catch (NoSuchAlgorithmException e) {
log.error("Got exception=", e);
throw new WebApplicationException(HTTP_INTERNAL_ERROR);
} finally {}
}
The client for this code is fairly simple:
public void sendLargeFile(String filename) {
WebClient wc = WebClient.create(targetUrl);
InputStream is = new FileInputStream(new File(filename));
Response r = wc.post(new Attachment(Constants.UPLOAD_FILE_DESCRIPTOR,
MediaType.APPLICATION_OCTET_STREAM, is));
}
The code works fine in terms of functionality. In terms of performance, I noticed that before my handler (receiveStream() method) gets the first byte out of the stream, the whole stream actually gets persisted into a temporary file (using a CachedOutputStream). Unfortunately, this is not acceptable for my purposes.
My handler simply passes the incoming bytes to a backend storage system (virtual machine disk repository), and waiting for the whole disk to be written to a cache only to be read again takes a lot of time, tying up a lot of resources, and reducing throughput.
There is a cost associated with writing the blocks and reading them again, since the app is running in the cloud, and the cloud provider charges per block read/written.
Since every byte is written to the local disk, my service VM must have enough disk space to accommodate the total sizes of all the streams being uploaded (i.e., if I have 10 uploads of 100GB each, I must have 1TB of disk just to cache the content). That again is extra money, as the size of the service VM grows dramatically, and the cloud provider charges for the provisioned disk size as well.
Given all of this, I am looking for a way to use the HTTP InputStream (or as close to it as possible) to read the attachment directly from there and handle it afterwards. I guess the question translates into one of:
- Is there a way to tell CXF not do caching
- OR - is there a way to pass CXF an output stream (one I write) to use, rather than using CachedOutputStream
I found a similar question here. The resolution says use CXF 2.2.3 or later, I am using 2.4.4 (and tried with 2.7.0) with no luck.
Thanks.
I think it's logically not possible (neither in CXF or anywhere else). You're calling getAllAttachements(), which means that the server should collect information about them from the HTTP input stream. It means that the entire stream has to go into memory for MIME parsing.
In your case you should work directly with the stream, and do the MIME parsing yourself:
public Response receiveStream(InputStream input) {
Now you have full control of the input and can consume it into memory byte-by-byte.
I ended up fixing the problem in an unelegant way, but it works, so I wanted to share my experience. Please do let me know if there are some "standard" or better ways.
Since I am writing the server side, I knew I was accessing all the attachments in the order they were sent, and process them as they are streamed in. So, to reflect that behavior of the handler method (receiveStream() method above), I created a new annotation on the server side called "#SequentialAttachmentProcessing" and annotatated my above method with it.
Also, wrote a subclass of Attachment, called SequentialAttachment that acts like a linked list. It has a skip() method that skips over the current attachment, and when an attachment ends, hasMore() method tells you whether there is another one.
Then I wrote a custom multipart/form-data provider which behaves as follows: If the target method is annotated as above, handle the attachment, otherwise call the default provider to do the handling. When it is handled by my provider, it always returns at most one attachment. Hence it could be misleading to a non-suspecting handling method. However, I think it is acceptable since the writer of the server must have annotated the method as "#SequentialAttachmentProcessing" and therefore must know what that entails.
As a result the implementation of the receiveStream() method is now something like:
#POST
#SequentialAttachmentProcessing
#Consumes(MediaType.MULTIPART_FORM_DATA)
#Path("/doupload")
public Response receiveStream(MultipartBody multipart) {
List<Attachment> allAttachments = body.getAllAttachments();
Assert.isTrue(allAttachments.size() <= 1);
if (allAttachment.size() > 0) {
Attachment head = allAttachments.get(0);
Assert.isTrue(head instanceof SequentialAttachment);
SequentialAttachment att = (SequentialAttachment) head;
while (att != null) {
DataHandler dh = att.getDataHandler();
InputStream is = dh.getInputStream();
byte[] buf = new byte[65536];
int n;
OutputStream os = getOutputStream();
while ((n = is.read(buf)) > 0) {
os.write(buf, 0, n);
}
if (att.hasMore()) {
att = att.next();
}
}
}
}
While this solved my immediate problem, I still believe there has to be a standard way of doing this. I hope this helps someone.

Categories