I am developing a feature where we need to back up our files on S3 bucket with the key pattern as "tmp/yyyy-mm-dd.file_type.fileName"
Now if I am running my app today for backing up the fileName "abc.txt", it will store that as the pattern specified.
Let's say tomorrow "abc.txt" is updated and the updated file now needs to be backed up on S3. Thus, it will be pushed with a different timestamp but with the same fileName present in key of our bucket.
So what should be done such that there is no redundancy on S3 bucket and the file should be overwritten?
It appears that you wish to implement de-duplication so that the same file is not stored multiple times.
Amazon S3 does not provide de-duplication as a feature. Your software would need to recognize such duplication and implement this capability itself.
Alternatively, you might want to use commercial backup software that already has this capability in-built.
Example: MSP36 Backup (formerly CloudBerry Backup)
Related
I'm implementing an app which offers file storage in Amazon S3, while the user operates with normal files & folders concept.
In the backend I use the AWS Java SDK Version 2.
Do I have any way to rename/move an entire folder (thus recursively modifying it's content) ?
Or do I have to manually implement a recursive parsing and invoking copy+delete for each resource ?
Thanks.
I don't think you can rename the file directory, the full path is actually an object store key.
In amazon S3 you don’t really have folders, it’s just a key.
So if you want to « rename a folder » unfortunately you have to move all your objects to another key.
I have enabled versioning on the bucket whenever a new bucket is created. But for backward compatibility (buckets which are already created), i'm downloading all files/keys and uploading again. I'm doing this:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
But i am not able to figure out while uploading how to upload file in their specific folder. Or is there another solution to fix backward compatibility.
S3 doesn't really have folders, it "fakes" it with logic that looks for common prefixes separated by "/". The "key" is the full path name of the file within the bucket.
It sounds like you are doing a get and then want to do a put of the same bytes to the same key in the bucket. If that is what you want to do, then just use the same key that you used for getting the object.
In a bucket that has versioning turned on, this will result in two copies of the file. It is not clear why you would want to do that (assuming that you are writing back exactly what you are reading). Wouldn't you just end up paying for two copies of the same file?
My understanding is that if you turn on versioning for a bucket that already has files in it, that everything works the way you would expect. If a pre-existing file gets deleted, it just adds a delete marker. If a pre-existing file gets overwritten, it keeps the prior version (as a prior version of the new file). There is no need to pro-actively rewrite existing files when you turn on versioning. If you are concerned, you can easily test this through either the S3 Console, the command line interface, or through one of the language-specific APIs.
I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.
I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.
Is there anyway to get the newly added files from Amazon S3 (like sort by timestamp or something)? My current approach is to go through each file and compare their creation timestamp, and it is very slow when i have lots of files in the bucket.