I am profiling my Java distributed crawler (that stores crawled documents in S3), and S3 insertion is definitely a bottleneck. In fact, at high enough number of threads, the threads will consistently get timeout exception from S3 due to the fact that it takes too long for S3 to read the data. Is there a bulk putObject function provided by either Amazon or another library that can do this more efficiently?
Example code:
BUCKET = ...; // S3 bucket definition
AmazonS3 client= ...;
InputStream is = ...; // convert the data into input stream
ObjectMetadata meta = ...; // get metadata
String key = ...;
client.putObject(new PutObjectRequest(BUCKET, key, is, meta));
I haven't used S3 with java but AWS does support multipart uploads for large files.
http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
The boto library for Python does support this for sure. I've used it to successfully upload very very large database backups before.
After looking at the javadocs for the java library I think you may need to use http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/UploadPartRequest.html instead of the regular request and you can get the multipart upload going.
Related
I want to save all messages that go in a particular SQS queue in the already created s3 bucket.
But I want to save those messages in certain directories for an easier search by date and time.
S3Client has software.amazon.awssdk.services.s3.model.PutObjectRequest
Where I can determine bucket, path where the object is saved and some headers
PutObjectRequest objectRequest =
PutObjectRequest.builder()
.bucket(bucketName)
.key(s3Path)
.metadata(keyAndMetadata.getMetadata())
.build();
After that s3Client.putObject(objectRequest, body) do the thing
Now, I want to configure s3 in a similar way using ExtendedClientConfiguration, but I can only see very simple input parameters
ExtendedClientConfiguration extendedClientConfiguration =
new ExtendedClientConfiguration()
.withPayloadSupportEnabled(s3Client, bucketName, false)
.withAlwaysThroughS3(true);
And after that, we create that extended Sqs client with no way to configure s3 more extensively
AmazonSQSExtendedClient amazonSQSExtendedClient = new AmazonSQSExtendedClient(sqsClient, extendedClientConfiguration);
I know that I could probably separately save all messages that go to SQS to s3, but I'd better configure all that on the client level. Does someone have any ideas?
I found out there's no way to configure s3 path on the client level. But back up to s3 wasn't created for that purpose, and saving to s3 probably should be handled differently. Deleting files from s3 as they disappear from SQS is the best option for using this library.
Existing ways of adding content to an S3 file using methods in AmazonS3 class are
by putObject with an InputStream
Creating a local file with content and uploading it to S3.
Is there a way an OutputStream can be created for an existing S3 object to which values from a list can be written into? I see there are no APIs for doing so.
It's possible to create an S3OutputStream which wraps the AmazonS3 client. See this gist for the implementation:
https://gist.github.com/blagerweij/ad1dbb7ee2fff8bcffd372815ad310eb
It automatically detects large files, and uses multi-part transfers when required. It uses a byte-array for buffering, the size of that buffer depends on your use case. (default is 10MB)
For example:
final S3Client s3Client = AmazonS3ClientBuilder.defaultClient();
final OutputStream out = new S3OutputStream(s3Client, "bucket", "path/file.ext");
You can take a look at AWS SDK Performing Operations on an Amazon S3 Object
...
S3Client s3;
...
// Put Object
s3.putObject(bucket,object);
I've been trying to extract an .xlsx file from a AWS bucket I created and store it as a multipartfile variable. I've tried many different approaches, but at best I get weird characters. I'm not finding much documentation on how to do this.
Thanks!
// you may need to initialize this differently to get the correct authorization
final AmazonS3Client s3Client = AmazonS3ClientBuilder.defaultClient();
final S3Object object = s3Client.getObject("myBucket", "fileToDownload.xlsx");
// with Java 7 NIO
final Path filePath = Paths.get("localFile.xlsx");
Files.copy(object.getObjectContent(), filePath);
final File localFile = filePath.toFile();
// or Apache Commons IO
final File localFile = new File("localFile.xlsx");
FileUtils.copyToFile(object.getObjectContent(), localFile);
I'm not 100% sure what you mean by "MultipartFile" - that's usually in the context of a file that's been sent to your HTTP web service via a multipart POST or PUT. The file you're getting from S3 is technically part of the response to an HTTP GET request, but the Amazon Java Library abstracts this away for you, and just gives you the results as an InputStream.
I'm trying to generate a pre-signed URL a client can use to upload an image to a specific S3 bucket. I've succesfully generated requests to GET files, like so:
GeneratePresignedUrlRequest urlRequest = new GeneratePresignedUrlRequest(bucket, filename);
urlRequest.setMethod(method);
urlRequest.setExpiration(expiration);
where expiration and method are Date and HttpMethod objects respectively.
Now I'm trying to create a URL to allow users to PUT a file, but I can't figure out how to set the maximum content-length. I did find information on POST policies, but I'd prefer to use PUT here - I'd also like to avoid constructing the JSON, though that doesn't seem possible.
Lastly, an alternative answer could be some way to pass an image upload from the API Gateway to Lambda so I can upload it from Lambda to S3 after validating file type and size (which isn't ideal).
While I haven't managed to limit the file size on upload, I ended up creating a Lambda function that is activated on upload to a temporary bucket. The function has a signature like the below
public static void checkUpload(S3EventNotification event) {
(this is notable because all the guides I found online refer to a S3Event class that doesn't seem to exist anymore)
The function pulls the file's metadata (not the file itself, as that potentially counts as a large download) and checks the file size. If it's acceptable, it downloads the file then uploads it to the destination bucket. If not, it simply deletes the file.
This is far from ideal, as uploads failing to meet the criteria will seem to work but then simply never show up (as S3 will issue a 200 status code on upload without caring what Lambda's response is).
This is effectively a workaround rather than a solution, so I won't be accepting this answer.
I read that GCS Storage REST api supports 3 upload methods:
simple HTTP uploaded
chunked upload
resumed upload
I see that google-api-services-storage-v1 uses resumed upload approach,
but I am curious how to change this, because resume upload wastes
2 HTTP requests 1 for metadata and the second for data.
Request body of the first request is just {"name": "xxx"}.
InputStreamContent contentStream = new InputStreamContent(
APPLICATION_OCTET_STREAM, stream);
StorageObject objectMetadata = new StorageObject()
.setName(id.getValue());
Storage.Objects.Insert insertRequest = storage.objects().insert(
bucketName, objectMetadata, contentStream);
StorageObject object = insertRequest.execute();
I believe that particular library exclusively uses resumable uploads. Resumable uploads are very useful for large transfers, as they can recover from error and continue the upload. This is indeed sub-optimal in some cases, such as if you wanted to upload a very large number of very small objects one at a time.
If you want to do perform simpler uploads, you might want to consider another library, such as gcloud-java, which can perform direct uploads like so:
Storage storage = StorageOptions.defaultInstance().service();
Bucket bucket = storage.get(bucketName);
bucket.create(objectName, /*byte[] or InputStream*/, contentType);
That'll use only one request, although for larger uploads I recommend sticking with resumable uploads.