AWS S3 Java SDK: When does the file actually begin download

AWS S3 Java SDK: When does the file actually begin download - java

When does the content of the file I am retrieving actually begin downloading from S3?
AmazonS3Client.getObject()
S3Object.getContent()
S3ObjectInputStream.read()
From what I can tell, it's the first one, but I haven't found the answer explicitly in the docs

It depends on how you define "begin downloading."
Technically, it's getObject(). Otherwise it wouldn't throw the exceptions it throws, some of which necessarily require that the service has been contacted and the download initiated (or failed or denied).
Be extremely careful when using this method; the returned Amazon S3 object contains a direct stream of data from the HTTP connection.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Client.html#getObject-com.amazonaws.services.s3.model.GetObjectRequest-
The download must have already started, for the stream to be available. Bytes are waiting... but you are sitting on a stream that you haven't read from, beyond the response headers... so the download is stalled from proceeding further until you start reading.

Related

Reading and processing data from an S3 stream

I’d like to read data from a large file (order of gbs) in S3 and process it on-the-fly (as opposed to loading the entire file in memory or caching it locally). In some cases, the processing may be lengthy and could potentially “stall” the reading process for several minutes or longer. That is, the connection used to stream the data may become idle for several minutes or more. Below is a contrived example that demonstrates this:
InputStream readStream = s3Client.getObject(GetObjectRequest.builder().bucket(bucketLocation).key(fileLocation).build());
readStream.readNBytes(100);
Thread.sleep(600000); // Wait for 10 mins
readStream.readAllBytes(); // Throws SocketException
In this example, the second read attempt will throw a java.net.SocketException: Connection reset error.
I’ve made several attempts to configure the HttpClient to keep the connection open, including the following configuration:
S3Client s3Client = S3Client.builder()
.httpClient(
ApacheHttpClient.builder()
.maxConnections(100)
.tcpKeepAlive(Boolean.TRUE)
.connectionTimeToLive(Duration.ofHours(1))
.connectionMaxIdleTime(Duration.ofHours(1))
.socketTimeout(Duration.ofHours(1))
.connectionTimeout(Duration.ofHours(1))
.build())
.region(region)
.credentialsProvider(awsCredentials)
.build();
Unfortunately, none of these settings seem to have any impact in resolving this particular problem. Is there anything else I’m missing here? Or is this design inherently flawed?

Look at using the Amazon S3 Transfer Manager to work with larger Amazon S3 objects over the standard client. The client object is S3TransferManager.
The Amazon S3 Transfer Manager is an open source, high level file transfer utility for the AWS SDK for Java 2.x. Use it to transfer files and directories to and from Amazon S3.
See the documentation, including examples, in the AWS SDK Java V2 Developer Guide.
Amazon S3 Transfer Manager

How to download files over 6MB

I have the most basic problem ever. The user wants to export some data which is around 20-70k records and can take from 20-40 seconds to execute and the file can be around 5-15MB.
Currently my code is as such:
User clicks a button which makes an API call to a Java Lambda
AWS Lambda Handler calls a method to get the data from DB and generate excel file using Apache POI
Set Response Headers and send the file as XLSX in the response body
I am now faced with two bottlenecks:
API Gateway times out after 29 seconds; if file takes longer to
generate it will not work and user get 504 in the browser
Response from lambda can only be 6MB, if file is bigger the user will
get 413/502 in the browser
What should be my approach to just download A GENERATED RUNTIME file (not pre-built in s3) using AWS?

If you want to keep it simple (no additional queues or async processing) this is what I'd recommend to overcome the two limitations you describe:
Use the new AWS Lambda Endpoints. Since that option doesn't use the AWS API Gateway, you shouldn't be restricted to the 29-sec timeout (not 100% sure about this).
Write the file to S3, then get a temporary presigned URL to the file and return a redirect (HTTP 302) to the client. This way you won't be restricted to the 6MB response size.

Here are the possible options for you.
Use Javascript skills to rescue. Accept the request from browser/client and immediately respond from server that your file preparation is in progress. Meanwhile continue preparing the file in the background (sperate job). Using java script, keep polling the status of file using separate request. Once the file is ready return it back.
Smarter front-end clients use web-sockets to solve such problems.
In case DB query is the culprit, cache the data on server side, if possible, for you.

When your script takes more than 30s to run on your server then you implement queues, you can get help from this tutorial on how to implement queues using SQS or any other service.
https://mikecroft.io/2018/04/09/use-aws-lambda-to-send-to-sqs.html
Once you implement queues your timeout issue will be solved because now you are fetching your big data records in the background thread on your server.
Once the excel file is ready in the background then you have to save it in your s3 bucket or hard disk on your server and create a downloadable link for your user.
Once the download link is created you will send that to your user via email. In this case, you should have your user email.
So the summary is Apply queue -> send a mail with the downloadable file.

Instead of some sophisticated solution (though that would be interesting).
Inventory. You will split the Excel in portions of say 10 k rows. Calculate the number of docs.
For every Excel generation called you have a reduced work load.
Whether e-mail, page with links, using a queue you decide.
The advantage is staying below e-mail limits, response time-outs, denial of service.
(In Excel one could also create a master document, but I have no experience.)

Reading end of huge and dynamic file via SFTP from server

I am trying to find a way to read just end of huge and dynamic log file (like 20-30 lines from end) via SFTP from server and to save the point until where I read, and if I need more lines, to read more from this point upper.
Everything I've tried takes too long time, I've tried to copy this file on machine and after this to read from end using ReversedLinesFileReader because this method need the File object, when via SFTP you will get only InputStream, takes a lot to download file.
Also tried to count lines and to read from n line but also takes too long and throws exception because sometime in this time file is modified. Another way I tried to connect via SSH and used tail -100 and get the desired result, but just for one time, because next time I will get also new logs, but I need to go upper. Is there a fast way to get the end of file and to save the point and to read more upper of this point later? Any idea?

You don't say what SFTP library you're using, but the most widely used Java SSH/SFTP library is JSch, so I'll assume you're using that.
The SFTP protocol has operations to perform random-access I/O on remote files. Unfortunately, the JSch SFTP client doesn't expose the full range of operations. However, it does have versions of the get operation (for getting a file from the remote server) which permit skipping over the first part of the remote file. You can use one of these operations to read for example the last 10 KB of a file.
Several of the JSch get operations return an InputStream. You can read the contents of the remote file from the input stream. If you want to access the remote file line-by-line, you can convert it to Reader using InputStreamReader.
So, a process might do the following:
Call stat() on the remote file to get its size.
Figure out where in the file you want to start reading from. You could keep track of where you stopped reading last time, or you could guess based on the amount of data you're willing to download and the expected size in bytes of these last 20-30 lines.
Call get() to start reading it.
Process data read from the InputStream returned by the get() call.

Best would be to have a kind of rotating log files, possibly with compression.
Hower rsync is a unidirectional synchronisation, that can transmit only the changed parts of a file: for a log the new end.
I am not sure whether it works sufficiently performant in your case, and ssh is a prerequisite.

Throttle speed at which a servlet accepts an HTTP Post Body under Tomcat

I have a servlet that accepts large (up to 4GB) binary file uploads. The submitted file is transmitted as the body of an HTTP POST.
The servlet has to perform some time-consuming processing as it receives the file, and it has to finish doing that before sending the response. As a result, it can appear to a fast client that the server has hung because the client can be waiting for a minute or two after sending the the last few bytes before getting the response.
Is there a way either within Tomcat or within the servlet API to throttle back the speed at which the server accepts the file? I would like it to appear to the client that the server is accepting the file at (for example) 10MB/second rather than it accepting the file at 50MB/second and then taking a few minutes after receiving the body to return a response.
Thanks.

I'm extending on the comment of Mark Thomas here because I feel that this is worth being an answer (or the answer), rather than a comment. Mark, let me know if you want to convert the comment yourself and I'll happily delete mine.
John, you're trying to solve your problem in a way that imposes severe limitations: What's the bandwidth that you want to throttle to? What happens when the server is upgraded to a beefier CPU and can process more quickly? What if multiple uploads happen at the same time?
You probably want to have an upload of 4G in as quick a time as possible - imagine the connection going down in the middle - in a web application this typically means you'll have to restart the upload from the beginning. Thus you should decouple your processing from the upload procedure as much as possible.
You also don't mention the file format that gets uploaded: If it happens to be a zip file, note that the server can't do anything with the file until it's fully transmitted, as zip files have the directory of contents at their end. (this might be old knowledge, but at least the old spec had it this way. Someone correct me if this changed)
So: The proper way: Accept the file for processing, signal that you received it and are processing. If you like: Implement Ajax updates once you're done. In the simplest case: "click here to see if processing finished" or frequently reload the page. Anything works and everything is better than throttling throughput on this layer.

Cannot delete GAE file

I'm trying to remove a file after a broken upload using
final FileService fileService = FileServiceFactory.getFileService();
fileService.delete(file);
But I get:
java.lang.UnsupportedOperationException: File \/blobstore\/writable:AD8BvukH[...]qau-Bb7AD does not have a finalized name
When I try to finalize the file with
FileWriteChannel writeChannel = fileService.openWriteChannel(file, true);
writeChannel.closeFinally();
then openWriteChannel() fails with
com.google.appengine.api.files.FinalizationException
[...]
Caused by: com.google.apphosting.api.ApiProxy$ApplicationException: ApplicationError: 101:
What does ApplicationError 101 mean?
How can I properly delete the file?

It looks like others have reported this problem and, although it was addressed, there could still be a problem with broken files.
Sep 11, 2013 at 1:14 am
We have now fixed this issue from reoccurring in future. However,
there are some blobs created in the past that still give errors. We
are working on a fix for these blobs.
John Lowry On behalf of the App Engine team
http://grokbase.com/t/gg/google-appengine/138xrawqw0/broken-blobstore-files-what-to-do
UnsupportedOperationException
For the first error, the documentation states:
java.lang.UnsupportedOperationException - if a file's type is not supported by delete or file does not have a finalized name.
It could be that the file is already finalized, and you can't delete it for some other reason.
ApplicationError: 101
I think the second error refers to a not found exception.
FinalizationError: ApplicationError: 101 Blobkey not found.
This may clarify the issue for you.
You only use finalize if you create a file and write to it. But you
cannot write to a file, after it has been finalized. To update a file
in the blobstore, you always have to create a new one. And when you
read a file, you do not have to finalize it. To read a file you have
to use a blobreader. See:
https://developers.google.com/appengine/docs/python/blobstore/blobreaderclass
via https://stackoverflow.com/a/12855653/1085891
Fixing the Broken Upload
You could resume the upload.
If the transfer is interrupted, you can resume the transfer from where it left off using the --db_filename=... argument.
via How to finish a broken data upload to the production Google App Engine server?
Additional Solutions / Information:
Cannot delete entity with broken id from datastore
Handle Form Failure when uploading to Appengine Blobstore
Issue 4744: Java dev server fails at deleting blobs.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.