Reading and processing data from an S3 stream

Reading and processing data from an S3 stream - java

I’d like to read data from a large file (order of gbs) in S3 and process it on-the-fly (as opposed to loading the entire file in memory or caching it locally). In some cases, the processing may be lengthy and could potentially “stall” the reading process for several minutes or longer. That is, the connection used to stream the data may become idle for several minutes or more. Below is a contrived example that demonstrates this:
InputStream readStream = s3Client.getObject(GetObjectRequest.builder().bucket(bucketLocation).key(fileLocation).build());
readStream.readNBytes(100);
Thread.sleep(600000); // Wait for 10 mins
readStream.readAllBytes(); // Throws SocketException
In this example, the second read attempt will throw a java.net.SocketException: Connection reset error.
I’ve made several attempts to configure the HttpClient to keep the connection open, including the following configuration:
S3Client s3Client = S3Client.builder()
.httpClient(
ApacheHttpClient.builder()
.maxConnections(100)
.tcpKeepAlive(Boolean.TRUE)
.connectionTimeToLive(Duration.ofHours(1))
.connectionMaxIdleTime(Duration.ofHours(1))
.socketTimeout(Duration.ofHours(1))
.connectionTimeout(Duration.ofHours(1))
.build())
.region(region)
.credentialsProvider(awsCredentials)
.build();
Unfortunately, none of these settings seem to have any impact in resolving this particular problem. Is there anything else I’m missing here? Or is this design inherently flawed?

Look at using the Amazon S3 Transfer Manager to work with larger Amazon S3 objects over the standard client. The client object is S3TransferManager.
The Amazon S3 Transfer Manager is an open source, high level file transfer utility for the AWS SDK for Java 2.x. Use it to transfer files and directories to and from Amazon S3.
See the documentation, including examples, in the AWS SDK Java V2 Developer Guide.
Amazon S3 Transfer Manager

Related

How to download files over 6MB

I have the most basic problem ever. The user wants to export some data which is around 20-70k records and can take from 20-40 seconds to execute and the file can be around 5-15MB.
Currently my code is as such:
User clicks a button which makes an API call to a Java Lambda
AWS Lambda Handler calls a method to get the data from DB and generate excel file using Apache POI
Set Response Headers and send the file as XLSX in the response body
I am now faced with two bottlenecks:
API Gateway times out after 29 seconds; if file takes longer to
generate it will not work and user get 504 in the browser
Response from lambda can only be 6MB, if file is bigger the user will
get 413/502 in the browser
What should be my approach to just download A GENERATED RUNTIME file (not pre-built in s3) using AWS?

If you want to keep it simple (no additional queues or async processing) this is what I'd recommend to overcome the two limitations you describe:
Use the new AWS Lambda Endpoints. Since that option doesn't use the AWS API Gateway, you shouldn't be restricted to the 29-sec timeout (not 100% sure about this).
Write the file to S3, then get a temporary presigned URL to the file and return a redirect (HTTP 302) to the client. This way you won't be restricted to the 6MB response size.

Here are the possible options for you.
Use Javascript skills to rescue. Accept the request from browser/client and immediately respond from server that your file preparation is in progress. Meanwhile continue preparing the file in the background (sperate job). Using java script, keep polling the status of file using separate request. Once the file is ready return it back.
Smarter front-end clients use web-sockets to solve such problems.
In case DB query is the culprit, cache the data on server side, if possible, for you.

When your script takes more than 30s to run on your server then you implement queues, you can get help from this tutorial on how to implement queues using SQS or any other service.
https://mikecroft.io/2018/04/09/use-aws-lambda-to-send-to-sqs.html
Once you implement queues your timeout issue will be solved because now you are fetching your big data records in the background thread on your server.
Once the excel file is ready in the background then you have to save it in your s3 bucket or hard disk on your server and create a downloadable link for your user.
Once the download link is created you will send that to your user via email. In this case, you should have your user email.
So the summary is Apply queue -> send a mail with the downloadable file.

Instead of some sophisticated solution (though that would be interesting).
Inventory. You will split the Excel in portions of say 10 k rows. Calculate the number of docs.
For every Excel generation called you have a reduced work load.
Whether e-mail, page with links, using a queue you decide.
The advantage is staying below e-mail limits, response time-outs, denial of service.
(In Excel one could also create a master document, but I have no experience.)

Best strategy to upload files with unknown size to S3

I have a server-side application that runs through a large number of image URLs and uploads the images from these URLs to S3.
The files are served over HTTP. I download them using InputStream I get from an HttpURLConnection using the getInputStream method. I hand the InputStream to AWS S3 Client putObject method (AWS Java SDK v1) to upload the stream to S3. So far so good.
I am trying to introduce a new external image data source. The problem with this data source is that the HTTP server serving these images does not return a Content-Length HTTP header. This means I cannot tell how many bytes the image will be, which is a number required by the AWS S3 client to validate the image was correctly uploaded from the stream to S3.
The only ways I can think of dealing with this issue is to either get the server owner to add Content-Length HTTP header to their response (unlikely), or to download the file to a memory buffer first and then upload it to S3 from there.
These are not big files, but I have many of them.
When considering downloading the file first, I am worried about the memory footprint and concurrency implications (not being able to upload and download chunks of the same file at the same time).
Since I am dealing with many small files, I suspect that concurrency issues might be "resolved" if I focus on the concurrency of the multiple files instead of a single file. So instead of concurrently downloading and uploading chunks of the same file, I will use my IO effectively downloading one file while uploading another.
I would love your ideas on how to do this, best practices, pitfalls or any other thought on how to best tackle this issue.

Using AWS S3 as an intermediate storage layer for monitoring platform

We have a use case where we want to use S3 to push event based + product metrics temporarily until they are loaded in a relational data warehouse (Oracle). These metrics would be sent by more than 200 application servers to S3 and persisted in different files per metric per server. The frequency of some of the metrics could be high for e.g. sending number of active http sessions on the app server every minute or the memory usage per minute. Once the metrics are persisted in S3, we would have something on the data warehouse that would read the csv file and load them in Oracle. We thought of using S3 over a queue (kafka/activemq/rabbit mq) due to various factors including cost, durability and replication. I have a few questions related to the write and read mechanisms with S3
For event based metrics, how can we write to S3 such that the app server is not blocked? I see that the java sdk does support asynchronous writes. Would that guarantee deliveries?
How can we update a csv file created on S3 by appending a record? From what I have read we cannot update an S3 object. What would be an efficient way for pushing monitoring metrics to S3 at periodic intervals?
When reading from S3, performance isn't a critical requirement. What would be an optimized way of loading the csv files into Oracle? A couple of ways included using the get object api from java sdk or mount S3 folders as NFS shares and creating external tables. Are there any other efficient ways of reading?
Thanks

FYI, 200 servers sending one request per minute is not "high". You are likely over engineering this. SQS is simple, highly redundant/available, and would likely meet your needs far better than growing your own solution.
To answer your questions in detail:
1) No, you cannot "guarantee delivery", especially with asynchronous S3 operations. You could design recoverable operations, but not guaranteed delivery.
2) That isn't what S3 is for... It's whole object writing... You would want to create a system where you add lots of small files... You probably don't want to do this. Updating a file (especially from multiple threads) is dangerous, each update will replace the entire file...
3) If you must do this, use the object api, process each file one-at-a-time, and delete them when you are done... You are much better off building a queue-based system.

AWS S3 Java SDK: When does the file actually begin download

When does the content of the file I am retrieving actually begin downloading from S3?
AmazonS3Client.getObject()
S3Object.getContent()
S3ObjectInputStream.read()
From what I can tell, it's the first one, but I haven't found the answer explicitly in the docs

It depends on how you define "begin downloading."
Technically, it's getObject(). Otherwise it wouldn't throw the exceptions it throws, some of which necessarily require that the service has been contacted and the download initiated (or failed or denied).
Be extremely careful when using this method; the returned Amazon S3 object contains a direct stream of data from the HTTP connection.
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Client.html#getObject-com.amazonaws.services.s3.model.GetObjectRequest-
The download must have already started, for the stream to be available. Bytes are waiting... but you are sitting on a stream that you haven't read from, beyond the response headers... so the download is stalled from proceeding further until you start reading.

Rate Limit s3 Upload using Java API

I am using the java aws sdk to transfer large files to s3. Currently I am using the upload method of the TransferManager class to enable multi-part uploads. I am looking for a way to throttle the rate at which these files are transferred to ensure I don't disrupt other services running on this CentOS server. Is there something I am missing in the API, or some other way to achieve this?

Without support in the API for this, one approach is to wrap the s3 command with trickle.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.