I have an S3 bucket that has a directory structure with input and output folders. The files are stored in the input directory using a tool that uses java API to communicate with S3, but in the case of moving the files from the input to the output directory we need to do it using other alternatives (even if it is a java class implemented by us).
What I need to know is if it is possible to bulk move files given a list of files to move, without having to be file by file calling the mv command in the unix s3 cli that seems to be very slow. I checked some information regarding bulk delete on S3 with java, but I need some more expertise opinion, if there's one.
I'm open to suggestions on languages to use that can have a API that serves my purpose.
PS: the question regarding the list of files to move is critical because the criteria to get those files is not possible to be implemented using the usual include/exclude that we have available in the unix s3 cli
Thanks in advance
*********** EDIT *********
Just found out the boto3 API that was very simple to set up. Just adding this info and tags to the subject in order to have more insights on this. Tks
There is no "move" command in Amazon S3. Instead, the objects would need to be copied, and then the source file deleted. This is what the AWS CLI actually does when doing aws s3 mv.
The great thing about the AWS CLI is that it issues copy commands in parallel, which greatly reduces the time to move a large number of objects. The fact is that the Amazon S3 CopyObject API call only accepts one object at a time. Hence, the need to issue such commands in parallel to move them faster.
An alternative is to use S3 Batch Operations. You can use Put object copy:
The Put object copy operation copies each object specified in the manifest. You can copy objects to a different bucket in the same AWS Region or to a bucket in a different Region. S3 Batch Operations supports most options available through Amazon S3 for copying objects. These options include setting object metadata, setting permissions, and changing an object's storage class.
The list of objects to copy can be specified in a CSV file. You would then need to delete the objects after the copy, which can be done via aws cli delete-objects and a list of objects.
If you need to call the aws cli on several files in parallel, you can use parallel on linux:
find . -name *.jpg | parallel aws s3 mv s3://bucketA/{} s3://bucketB/
You'll need to install it though. For example:
sudo apt install parallel
or
sudo yum install parallel
Related
I'm implementing an app which offers file storage in Amazon S3, while the user operates with normal files & folders concept.
In the backend I use the AWS Java SDK Version 2.
Do I have any way to rename/move an entire folder (thus recursively modifying it's content) ?
Or do I have to manually implement a recursive parsing and invoking copy+delete for each resource ?
Thanks.
I don't think you can rename the file directory, the full path is actually an object store key.
In amazon S3 you don’t really have folders, it’s just a key.
So if you want to « rename a folder » unfortunately you have to move all your objects to another key.
I want to write a java program to zip an aws object (file or a directory) in a given location (S3 bucket partition) to another given location. I have done the same task not foe S3 objects but for local disk files and directories. is there any direct way (using a class or interface) to do so?
And i got to read that it can be done by
download object to local
Zip the downloaded file
upload to the desired location
is the practical way to do this. Do anyone have a better idea or the classes and interfaces that can be used for the above steps! I appreciate your help!
For steps (1) and (3) in your list, this has examples of how to download and upload objects to S3: https://docs.aws.amazon.com/sdk-for-java/v2/developer-guide/examples-s3-objects.html.
The classes and interfaces mentioned in the examples are documented here: https://sdk.amazonaws.com/java/api/latest/. Scroll down in the top left frame to the S3 packages starting with software.amazon.awssdk.services.s3.
For step (2) (zipping the file once downloaded), you could use the code you used for zipping the local files and folders.
Similar question has been already answered here -
Is it possible to compress files which are already in AWS S3?
If you are looking for aws sdk java api for s3 -
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/AmazonS3Client.html
I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.
I am planning to write a java program that synchronizes non existing FTP data with Amazon S3 at specific time interval. Here i am planning to skip the files/folders that are already copied/uploaded to S3. Is this a best possible way with good performance to achieve this functionality? or shall i find some other way to achieve this? If java program is good enough to design it, would like to know the best possible strategies i shall apply to achieve the best performance out of it.
S3fs is the best option if you are using FTP.
You can install fuse on your machine and mount a point to access S3 directly. Make the FTP home directory as your mount point so that whenever any data is uploaded to home it goes directly into s3.
NOTE: (When something is deleted from s3 or home directory it reflects in s3 bucket)
I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.