How to upload a Java OutputStream to AWS S3

How to upload a Java OutputStream to AWS S3 - java

I create PDF docs in memory as OutputStreams. These should be uploaded to S3. My problem is that it's not possible to create a PutObjectRequest from an OutputStream directly (according to this thread in the AWS dev forum). I use aws-java-sdk-s3 v1.10.8 in a Dropwizard app.
The two workarounds I can see so far are:
Copy the OutputStream to an InputStream and accept that twice the amount of RAM is used.
Pipe the OutputStream to an InputStream and accept the overhead of an extra thread (see this answer)
If i don't find a better solution I'll go with #1, because it looks as if I could afford the extra memory more easily than threads/CPU in my setup.
Is there any other, possibly more efficient way to achive this that I have overlooked so far?
Edit:
My OutputStreams are ByteArrayOutputStreams

I solved this by subclassing ConvertibleOutputStream:
public class ConvertibleOutputStream extends ByteArrayOutputStream {
//Craetes InputStream without actually copying the buffer and using up mem for that.
public InputStream toInputStream(){
return new ByteArrayInputStream(buf, 0, count);
}
}

What's the actual type of your OutputStream? Since it's an abstract class, there's no saying where the data actually goes (or if it even goes anywhere).
But let's assume that you're talking about a ByteArrayOutputStream since it at least keeps the data in memory (unlike many many others).
If you create a ByteArrayInputStream out of its buffer, there's no duplicated memory. That's the whole idea of streaming.

another workaround is to use presigned url feature of s3.
since presigned url allows you to upload files to s3 with http put or post, it is possible to send your output stream to HttpURLConnection.
sample code from amazon

Related

When to use deflate() of deflateroutputstream?

I'm trying to learn how to use DeflaterOutputStream as something to kill time during my winter break. I'm confused because when I look at the documentation https://docs.oracle.com/javase/7/docs/api/java/util/zip/DeflaterOutputStream.html, it says that deflate() is used to write a compressed data to OutputStream, while write() is to write data to the DeflaterOutputStream (compressed OutputStream) to be compressed.
However, I'm looking at sample codes on the internet, but none uses deflate() at all. All the code I've seen so far just write() to the DeflaterOutputStream without calling deflate().
https://stackoverflow.com/a/13060441/12181863
https://www.programcreek.com/java-api-examples/?api=java.util.zip.DeflaterOutputStream
I noticed that the code puts a FileOutputStream inside the DeflaterOutputStream, but how does it interact? Does it automatically call deflate() to send compressed data to FileOutputStream when data is written to DeflaterOutputStream?

It's protected: It is intended for anything subclassing that stream, and you're not subclassing it, so as far as you are concerned, it is an implementation detail you cannot include in your reasoning and which isn't meant for you to invoke.
Unless, of course, you subclass it.
Which you could - it's sort of a toolkit for building LZ-based compression streams on top of. That's why both GZipOutputStream and ZipOutputStream extend it: Those are different containers that more or less use the same compression technology. And they do invoke that deflate. Unless you're developing your own LZ-based compression system or implementing a reader for an existing, non-zip, non-gz, non-deflater based compression format, this is not meant for you.
These kinds of outputstreams are called 'filterstreams': They do not themselves represent any resource, they wrap around one. They can wrap around any OutputStream (or any InputStream, the concept works on 'both sides' so to speak), and modify bytes in transit.
var out = new DeflaterOutputStream(whatever) creates a new deflater stream that will compress any data you send to it (via out.write(stuff)), and it will in turn take the compressed data and send it on to whatever. It does the job of:
take bytes (as per out.write), buffer as much as is needed to do the job:
... of compressing this data.
Then process the compressed data, as it becomes compressed, by sending it to the wrapped outputstream (whatever, in this example), by calling its write method.
The basic usage is:
Create a resource, such as Files.newOutputStream or someSocket.getOutputStream or httpServletResponse.getOutputStream() or System.out or anything else that produces a stream - it's a abstract concept for a reason: To make things flexible.
Wrap that resource into a DeflaterOutputStream
Write all your data to the deflateroutputstream. Forget about the original - you made it so you can pass it to DeflaterOutputStream, and that's where your interaction with the underlying stream ends.
Close the deflaterstream (which will end up closing the underlying stream as well).

Hadoop S3A filesystem, abort object upload?

I have code like
ParquetWriter<Record> writer = getParquetWriter("s3a://my_bucket/my_object_path.snappy.parquet");
for (Record r : someIterable) {
validate(r);
writer.write()
}
writer.close();
if validate throws an exception, I want to release all resources associated with the writer. But I don't want to create any objects in S3 in that case. Is this achievable?
If I close the writer it will conclude the s3 multipart upload and create an object in the cloud. If I don't close it, the parts written so far will remain in the disk buffer, clogging up the works.

Yes it is a problem. It's been discussed in HADOOP-16906 Add some Abortable.abort() interface for streams etc which can be terminated
Problem here is it's not enough to add to the S3ABlockOutputStream class, we'd need to pass it through the FSDataOutputStream etc, specify it in the FS APIs, define semantics if the passthrough doesn't work, commit to maintaining it etc. A lot of effort. If you do want to do that though, patches welcome...
Keep an eye on HDFS-13934, multipart upload API. This will let you do the upload and then commit/abort it. Doesn't quite fit your workflow.
Afraid you will have to go with the upload. Do remember to set a lifecycle rule for the bucket to delete old uploads, and look at the hadoop s3guard uploads command to list/abort them too.

is there a more efficient way of sending an mp4 file to the user

I am using Spring-MVC and I need to send a MP4 file back to the user. The MP4 files are, of course, very large in size (> 2 GB).
I found this SO thread Downloading a file from spring controllers, which shows how to stream back a binary file, which should theoretically work for my case. However, what I am concerned about is efficiency.
In one case, an answer may implicate to load all the bytes into memory.
byte[] data = SomeFileUtil.loadBytes(new File("somefile.mp4"));
In another case, an answer suggest using IOUtils.
InputStream is = new FileInputStream(new File("somefile.mp4"));
OutputStream os = response.getOutputStream();
IOUtils.copy(is, os);
I wonder if either of these are more memory efficient than simply defining a resource mapping?
<resources mapping="/videos/**" location="/path/to/videos/"/>
The resource mapping may work, except that I need to protect all requests to these videos, and I do not think resource mapping will lend itself to logic that protects the content.
Is there another way to stream back binary data (namely, MP4)? I'd like something that's memory efficient.

I would think that defining a resource mapping would be the cleanest way of handling it. With regards to protecting access, you can simply add /videos/** to your security configuration and define what access you allow for it via something like
<security:intercept-url pattern="/videos/**" access="ROLE_USER, ROLE_ADMIN"/>
or whatever access you desire.
Also, you might consider saving these large mp4's to a cloud storage and/or CDN such as Amazon S3 (with our without CloudFront).
Then you can generate unique urls which will last as long as you want them to. Then the download is handled by Amazon rather than having to use the computing power, data space, and memory of your web server to serve up the large resource files. Also, if you use something like CloudFront, you can configure it for streaming rather than download.

Loading the entire file into memory is worse, as well as using more memory and being non-scalable. You don't transmit any data until you've loaded it all, which adds all that latency.

Reading a file using FileInputStream vs URLConnection

I have a piece of code that retrieves a file either from a remote server or from the local disk.
I understand that URLConnection can handle both cases, so I was wondering if there was any performance advantage if I used FileInputStream to read the local file rather than just handing it off to URLConnection to read from disk?

No, there's no performance advantage to using FileInputStream over a URLConnection (well, unless you're counting the milliseconds of a handful of extra method calls).
Reading a file via a file:// URL eventually gets you a FileURLConnection (note that this is not part of the official Java library spec, just the Sun-based JREs). If you look at the code, you'll see that it's creating a FileInputStream to work with the file on disk. So other than walking a few layers further down in the stack, the code ends up exactly the same.
The reason why you'd want to use a FileInputStream directly is for clarity of your code. Turning a file path into a URL is a little ugly, and it'd be confusing to do it if you were only ever going to work with files.
In your case, where you need to work with URLs some of the time, it's quite convenient that you can use a file URL and only work with URLs. I imagine you've abstracted nearly all of the interesting logic to work on URLs and can do the ugly business of constructing a file or non-file URL elsewhere.

A FileInputStream obtains input bytes from a file in a file system. FileInputStream is meant for reading streams of raw bytes such as image data.
FileReader is meant for reading streams of characters.
In general, creating a connection to a URL is a multistep process:
The connection object is created by invoking the openConnection method on a URL.
The setup parameters and general request properties are manipulated.
The actual connection to the remote object is made, using the connect method.
The remote object becomes available. The header fields and the contents of the remote object can be accessed.
I think a good rule of thumb is to use the simplest code (object) possible in order to remain the most efficient. Think minimalist!
P.S. Not sure if you're just moving the file or reading its contents.

How to create a file that streams to http response

I'm writing a web application and want the user to be able click a link and get a file download.
I have an interface is in a third party library that I can't alter:
writeFancyData(File file, Data data);
Is there an easy way that I can create a file object that I can pass to this method that when written to will stream to the HTTP response?
Notes:
Obviously I could just write a temporary file and then read it back in and then write it the output stream of the http response. However what I'm looking for is a way to avoid the file system IO. Ideally by creating a fake file that when written to will instead write to the output stream of the http response.
e.g.
writeFancyData(new OutputStreamBackedFile(response.getOutputStream()), data);
I need to use the writeFancyData method as it writes a file in a very specific format that I can't reproduce.

Assuming writeFancyData is a black box, it's not possible. As a thought experiment, consider an implementation of writeFancyData that did something like this:
public void writeFancyData(File file, Data data){
File localFile = new File(file.getPath());
...
// process data from file
...
}
Given the only thing you can return from any extended version of File is the path name, you're just not going to be able to get the data you want into that method. If the signature included some sort of stream, you would be in a lot better position, but since all you can pass in is a File, this can't be done.
In practice the implementation is probably one of the FileInputStream or FileReader classes that use the File object really just for the name and then call out to native methods to get a file descriptor and handle the actual i/o.

As dlawrence writes the API it is impossible to determine what the API is doing with the File.
A non-java approach is to create a named pipe. You could establish a reader for the pipe in your program, create a File on that path and pass it to API.
Before doing anything so fancy, I would recommend analyzing performance and verify that disk i/o is indeed a bottleneck.

Given that API, the best you can do is to give it the File for a file in a RAM disk filesystem.
And lodge a bug / defect report against the API asking for an overload that takes a Writer or OutputStream argument.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.