How to get recently added files from Amazon S3 - java

Is there anyway to get the newly added files from Amazon S3 (like sort by timestamp or something)? My current approach is to go through each file and compare their creation timestamp, and it is very slow when i have lots of files in the bucket.

Related

How Can I read the metadata of files in S3 buckets using spark?

I am slightly new to AWS as well as Spark. I am stuck with a problem. I have a folder in my S3 bucket which contains 2 files named financial_data1.csv and financial_data2.csv. I am trying to read the records from both files and do an upsert.
In order to run the upsert I need to have the timestamp of each record as in which record is the latest. Now my csv files on the record level does not contain the created_timestamp/updated_timestamp type of column.So I have to be dependent upon the last modified time of the file upload in S3 bucket which is nothing but the timestamp of the file when it was uploaded.
Can I read this uploaded time of file from my spark-scala/spark-java code ?
There are currently 2 solutions which are in my mind :
Run a lambda to rename the files and add the timestamp in the filename.
Read all the files from java/scala code and use the summary object to fetch the last modified date. =
Although I can do both of the above but it will only add another overhead of calling S3 to fetch files first through java/scala code and then reading data through spark.
What I want to do is directly read the file content and it's file name+last modified date. Is this possible ?

Overwrite a file on S3 bucket

I am developing a feature where we need to back up our files on S3 bucket with the key pattern as "tmp/yyyy-mm-dd.file_type.fileName"
Now if I am running my app today for backing up the fileName "abc.txt", it will store that as the pattern specified.
Let's say tomorrow "abc.txt" is updated and the updated file now needs to be backed up on S3. Thus, it will be pushed with a different timestamp but with the same fileName present in key of our bucket.
So what should be done such that there is no redundancy on S3 bucket and the file should be overwritten?
It appears that you wish to implement de-duplication so that the same file is not stored multiple times.
Amazon S3 does not provide de-duplication as a feature. Your software would need to recognize such duplication and implement this capability itself.
Alternatively, you might want to use commercial backup software that already has this capability in-built.
Example: MSP36 Backup (formerly CloudBerry Backup)

download all files from S3 and upload them in same folder

I have enabled versioning on the bucket whenever a new bucket is created. But for backward compatibility (buckets which are already created), i'm downloading all files/keys and uploading again. I'm doing this:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
But i am not able to figure out while uploading how to upload file in their specific folder. Or is there another solution to fix backward compatibility.
S3 doesn't really have folders, it "fakes" it with logic that looks for common prefixes separated by "/". The "key" is the full path name of the file within the bucket.
It sounds like you are doing a get and then want to do a put of the same bytes to the same key in the bucket. If that is what you want to do, then just use the same key that you used for getting the object.
In a bucket that has versioning turned on, this will result in two copies of the file. It is not clear why you would want to do that (assuming that you are writing back exactly what you are reading). Wouldn't you just end up paying for two copies of the same file?
My understanding is that if you turn on versioning for a bucket that already has files in it, that everything works the way you would expect. If a pre-existing file gets deleted, it just adds a delete marker. If a pre-existing file gets overwritten, it keeps the prior version (as a prior version of the new file). There is no need to pro-actively rewrite existing files when you turn on versioning. If you are concerned, you can easily test this through either the S3 Console, the command line interface, or through one of the language-specific APIs.

How to sync directory with AWS S3 using Java SDK?

I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.

Amazon S3 Java SDK: Recursive copy from S3 to S3 [duplicate]

I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.

Categories