Best strategy to upload files with unknown size to S3 - java

I have a server-side application that runs through a large number of image URLs and uploads the images from these URLs to S3.
The files are served over HTTP. I download them using InputStream I get from an HttpURLConnection using the getInputStream method. I hand the InputStream to AWS S3 Client putObject method (AWS Java SDK v1) to upload the stream to S3. So far so good.
I am trying to introduce a new external image data source. The problem with this data source is that the HTTP server serving these images does not return a Content-Length HTTP header. This means I cannot tell how many bytes the image will be, which is a number required by the AWS S3 client to validate the image was correctly uploaded from the stream to S3.
The only ways I can think of dealing with this issue is to either get the server owner to add Content-Length HTTP header to their response (unlikely), or to download the file to a memory buffer first and then upload it to S3 from there.
These are not big files, but I have many of them.
When considering downloading the file first, I am worried about the memory footprint and concurrency implications (not being able to upload and download chunks of the same file at the same time).
Since I am dealing with many small files, I suspect that concurrency issues might be "resolved" if I focus on the concurrency of the multiple files instead of a single file. So instead of concurrently downloading and uploading chunks of the same file, I will use my IO effectively downloading one file while uploading another.
I would love your ideas on how to do this, best practices, pitfalls or any other thought on how to best tackle this issue.

Related

Uploading large size (more than 1 GB) file on Amazon S3 using Java: Large files temporary consuming lot of space in server

I am trying to upload large files (more than 1 GB) on amazon S3 using Java
I am using AWS S3 multipart upload to upload large files in chunks.
https://docs.aws.amazon.com/AmazonS3/latest/dev/HLuploadFileJava.html
I am using also uploading the files in chunks from the frontend.
So, the file being uploaded will be temporarily uploaded on the server in chunks and it will be uploaded on S3 in chunks.
Now the problem is that this method puts a huge load on the server since this consumes server space temporarily. If multiple users are trying to upload large files at the same time then it will create an issue.
Is there any way of directly uploaded files from the user's system to amazon S3 in chunks without storing the file on server temporarily?
If upload the files via frontend directly then there a major risk of keys getting exposed.
You should leverage the upload directly from client with Signed URL
There are plenty documentation for this
AWS SDK Presigned URL + Multipart upload
The presigned URLs are useful if you want your user/customer to be able to upload a specific object to your bucket, but you don't require them to have AWS security credentials or permissions.
You could also be interested in limiting the size that user is able to upload
Limit Size Of Objects While Uploading To Amazon S3 Using Pre-Signed URL
Think about signed URL as a temporary credential for client to access a specific S3 location. These credential expire in a short time so there is less security concern, but do remember to restrict the access of the signed URLs appropriately

How to compress files on azure data lake store

I'm using Azure data lake store as a storage service for my Java app, sometimes I need to compress multiples files, what I do for now is I copy all files into the server compress them locally and then send the zip to azure, even though this is work it take a lot of time, so I'm wondering is there a way to compress files directly on azure, I checked the data-lake-store-SDK, but there's no such functionality.
Unfortunately, at the moment there is no option to do that sort of compression.
There is an open feature request HTTP compression support for Azure Storage Services (via Accept-Encoding/Content-Encoding fields) that discusses uploading compressed files to Azure Storage, but there is no estimation on when this feature might be released.
The only option for you is to implement such a mechanism on your own (using an Azure Function for example).
Hope it helps!

Upload 2GB file with Apache HttpClient

I'm new to Apache HC and I'm wondering what is the best way to upload files of about 1 or 2 GB size.
I am using the Minio SDK to retrieve a presigned url from the server. After that, I am sending this presigned url to the client that will upload the specified file.
From Minio side, the max size per put operation is 5GiB so there should be no problems from minio side. My main concern is:
What would be the best way to achieve the upload of the file from Apache HC in order to get the best performance / less error prone behaviour?
I'm guessing that directly uploading a 2GB file is not a good option. Does the Apache HttpClient handles that upload in case an error occur? Is it convenient to upload the file as parts? How do I achieve that?
From Minio side, the max size per put operation is 5GiB so there should be no problems from minio side. My main concern is:
What would be the best way to achieve the upload of the file from Apache HC in order to get the best performance / less error prone behaviour?
I'm guessing that directly uploading a 2GB file is not a good option. Does the Apache HttpClient handles that upload in case an error occur? Is it convenient to upload the file as parts? How do I achieve that?
Minio server changed the maximum size to 16GiB per PutObject. Apache Client should handle uploading upto 2GiB without issues.

Store a file inside an object

I have a Java client/server desktop application, where the communication between client and server is based on Sockets, and the messages exchanged between client and server are serialized objects (message objects, that incapsulate requests and responses).
Now I need to make the client able to upload a file from the local computer to the server, but I can't send the File through the buffer, since the Buffer is already used for exchanging the message objects.
Should i open another stream to send the file, or is there any better way to upload a file for my situation?
I need to make the client able to upload a file from the local computer to the server
- Open a Solely Dedicated Connection to the Server for File uploading.
- Use File Transfer Protocol to ease your work, and moreover its quite easy and reliable to use the Apache's common lib for File uploading and downloading....
See this link:
http://commons.apache.org/net/
You really only have two options:
Open another connection dedicated to the file upload and send it through that.
Make a message object representing bits of a file being uploaded, and send the file in chunks via these message objects.
The former seems simpler & cleaner to me, requiring less overhead and less complicated code.
You can keep your solution and pass the file content as an object, for example as a String - use Base64 encoding (or similar) of the content if it contains troublesome characters

java rest streaming upload

Using Rest webservices (WS) to upload file whose size is between 10 and 50 MB
At the moment, we use Java, Jax-RS and CXF for doing it.
The behavior of this stack is to buffer the uploaded file by writing them into a temporary file (because we have large files). This is fine for most users.
Is it possible to stream directly from the socket input?
(not from a whole file in memory nor from a temporary file)
My purpose is to have less overhead on IOs and CPUs (each file is written twice : 1 buffer and 1 final). The WS only have to write the files (sometimes several in the same HTTPrequest) to a path that I calculate from the HTTP query string.
Thanks for your attention

Categories