I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.
Related
I'm implementing an app which offers file storage in Amazon S3, while the user operates with normal files & folders concept.
In the backend I use the AWS Java SDK Version 2.
Do I have any way to rename/move an entire folder (thus recursively modifying it's content) ?
Or do I have to manually implement a recursive parsing and invoking copy+delete for each resource ?
Thanks.
I don't think you can rename the file directory, the full path is actually an object store key.
In amazon S3 you don’t really have folders, it’s just a key.
So if you want to « rename a folder » unfortunately you have to move all your objects to another key.
I have an S3 bucket that has a directory structure with input and output folders. The files are stored in the input directory using a tool that uses java API to communicate with S3, but in the case of moving the files from the input to the output directory we need to do it using other alternatives (even if it is a java class implemented by us).
What I need to know is if it is possible to bulk move files given a list of files to move, without having to be file by file calling the mv command in the unix s3 cli that seems to be very slow. I checked some information regarding bulk delete on S3 with java, but I need some more expertise opinion, if there's one.
I'm open to suggestions on languages to use that can have a API that serves my purpose.
PS: the question regarding the list of files to move is critical because the criteria to get those files is not possible to be implemented using the usual include/exclude that we have available in the unix s3 cli
Thanks in advance
*********** EDIT *********
Just found out the boto3 API that was very simple to set up. Just adding this info and tags to the subject in order to have more insights on this. Tks
There is no "move" command in Amazon S3. Instead, the objects would need to be copied, and then the source file deleted. This is what the AWS CLI actually does when doing aws s3 mv.
The great thing about the AWS CLI is that it issues copy commands in parallel, which greatly reduces the time to move a large number of objects. The fact is that the Amazon S3 CopyObject API call only accepts one object at a time. Hence, the need to issue such commands in parallel to move them faster.
An alternative is to use S3 Batch Operations. You can use Put object copy:
The Put object copy operation copies each object specified in the manifest. You can copy objects to a different bucket in the same AWS Region or to a bucket in a different Region. S3 Batch Operations supports most options available through Amazon S3 for copying objects. These options include setting object metadata, setting permissions, and changing an object's storage class.
The list of objects to copy can be specified in a CSV file. You would then need to delete the objects after the copy, which can be done via aws cli delete-objects and a list of objects.
If you need to call the aws cli on several files in parallel, you can use parallel on linux:
find . -name *.jpg | parallel aws s3 mv s3://bucketA/{} s3://bucketB/
You'll need to install it though. For example:
sudo apt install parallel
or
sudo yum install parallel
I have enabled versioning on the bucket whenever a new bucket is created. But for backward compatibility (buckets which are already created), i'm downloading all files/keys and uploading again. I'm doing this:
fullObject = s3Client.getObject(new GetObjectRequest(bucketName, key));
But i am not able to figure out while uploading how to upload file in their specific folder. Or is there another solution to fix backward compatibility.
S3 doesn't really have folders, it "fakes" it with logic that looks for common prefixes separated by "/". The "key" is the full path name of the file within the bucket.
It sounds like you are doing a get and then want to do a put of the same bytes to the same key in the bucket. If that is what you want to do, then just use the same key that you used for getting the object.
In a bucket that has versioning turned on, this will result in two copies of the file. It is not clear why you would want to do that (assuming that you are writing back exactly what you are reading). Wouldn't you just end up paying for two copies of the same file?
My understanding is that if you turn on versioning for a bucket that already has files in it, that everything works the way you would expect. If a pre-existing file gets deleted, it just adds a delete marker. If a pre-existing file gets overwritten, it keeps the prior version (as a prior version of the new file). There is no need to pro-actively rewrite existing files when you turn on versioning. If you are concerned, you can easily test this through either the S3 Console, the command line interface, or through one of the language-specific APIs.
I came across this article https://aws.amazon.com/blogs/developer/syncing-data-with-amazon-s3/ which made me aware of the uploadDirectory() method. The blog states: "This small bit of code compares the contents of the local directory to the contents in the Amazon S3 bucket and only transfer files that have changed." This does not seem to be entirely correct since it appears to always transfer every file in a given directory as opposed to only the files that have changed.
I was able to do what I wanted using AWSCLI's s3 sync command, however the goal is to be able to do this syncing using the Java SDK. Is it possible to do this same type of sync using the Java SDK?
There is no SDK implementation of s3 sync command. You will have to implement it in Java if needed. According to the CLI doc https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html,
An s3 object will require downloading if one of the following
conditions is true:
The s3 object does not exist in the local directory.
The size of the s3 object differs from the size of the local file.
The last modified time of the s3 object is older than the last
modified time of the local file.
Therefore essentially you will need to compare objects in target bucket with your local files based on above rules.
Also note that above checking will not handle --delete, so you might need to implement the logic for deleting remote objects when the local file does not exist if it is needed.
I've found it, it is TransferManage.uploadDirectory()
TransferManager.copy() might do something similar, but I do not know what behaviour is employed in case a file or directory with the same name and modification time exists on the destination server.
For local development with appengine, I need to change where uploaded images are stored with the GCS service so that they are persisted across builds. Right now, a new build wipes out the target directory along with the images in the appengine-generated directory.
I had the same problem with the datastore but was able to fix this by setting a property to use a datastore located in my repo outside of the target directory.
-Ddatastore.backing_store=../../local_db.bin
Is there a comparable property for the images/files saved using the GCS service?
With the Python local server, --storage_path=... determines where everything is stored ("Datastore, Blobstore files, Google Cloud Storage Files, logs, etc", to quote the docs) unless explicitly overridden. It doesn't appear that the possible values listed for Java at https://cloud.google.com/appengine/docs/java/tools/localunittesting/javadoc/constant-values encompass a similarly all-inclusive path, however.
As #alex pointed out, there is a parameter to define where all local files are stored for python and it exists for java too.
For java the parameter is --generated_dir=<path> which is a server param not a JVM option.
Also note that this overwrites the usage of -Ddatastore.backing_store=<local_db.bin>.
There documentation on this feature here: https://cloud.google.com/appengine/docs/java/tools/devserver?hl=en