I am planning to write a java program that synchronizes non existing FTP data with Amazon S3 at specific time interval. Here i am planning to skip the files/folders that are already copied/uploaded to S3. Is this a best possible way with good performance to achieve this functionality? or shall i find some other way to achieve this? If java program is good enough to design it, would like to know the best possible strategies i shall apply to achieve the best performance out of it.
S3fs is the best option if you are using FTP.
You can install fuse on your machine and mount a point to access S3 directly. Make the FTP home directory as your mount point so that whenever any data is uploaded to home it goes directly into s3.
NOTE: (When something is deleted from s3 or home directory it reflects in s3 bucket)
Related
I'm implementing an app which offers file storage in Amazon S3, while the user operates with normal files & folders concept.
In the backend I use the AWS Java SDK Version 2.
Do I have any way to rename/move an entire folder (thus recursively modifying it's content) ?
Or do I have to manually implement a recursive parsing and invoking copy+delete for each resource ?
Thanks.
I don't think you can rename the file directory, the full path is actually an object store key.
In amazon S3 you don’t really have folders, it’s just a key.
So if you want to « rename a folder » unfortunately you have to move all your objects to another key.
I have an S3 bucket that has a directory structure with input and output folders. The files are stored in the input directory using a tool that uses java API to communicate with S3, but in the case of moving the files from the input to the output directory we need to do it using other alternatives (even if it is a java class implemented by us).
What I need to know is if it is possible to bulk move files given a list of files to move, without having to be file by file calling the mv command in the unix s3 cli that seems to be very slow. I checked some information regarding bulk delete on S3 with java, but I need some more expertise opinion, if there's one.
I'm open to suggestions on languages to use that can have a API that serves my purpose.
PS: the question regarding the list of files to move is critical because the criteria to get those files is not possible to be implemented using the usual include/exclude that we have available in the unix s3 cli
Thanks in advance
*********** EDIT *********
Just found out the boto3 API that was very simple to set up. Just adding this info and tags to the subject in order to have more insights on this. Tks
There is no "move" command in Amazon S3. Instead, the objects would need to be copied, and then the source file deleted. This is what the AWS CLI actually does when doing aws s3 mv.
The great thing about the AWS CLI is that it issues copy commands in parallel, which greatly reduces the time to move a large number of objects. The fact is that the Amazon S3 CopyObject API call only accepts one object at a time. Hence, the need to issue such commands in parallel to move them faster.
An alternative is to use S3 Batch Operations. You can use Put object copy:
The Put object copy operation copies each object specified in the manifest. You can copy objects to a different bucket in the same AWS Region or to a bucket in a different Region. S3 Batch Operations supports most options available through Amazon S3 for copying objects. These options include setting object metadata, setting permissions, and changing an object's storage class.
The list of objects to copy can be specified in a CSV file. You would then need to delete the objects after the copy, which can be done via aws cli delete-objects and a list of objects.
If you need to call the aws cli on several files in parallel, you can use parallel on linux:
find . -name *.jpg | parallel aws s3 mv s3://bucketA/{} s3://bucketB/
You'll need to install it though. For example:
sudo apt install parallel
or
sudo yum install parallel
I'm planing on using Amazon S3 to store milions of relatively small files (~100kB-2mB). To save on upload time I structured them into directories (tens/hundreds of files per directory), and decided to use TransferManager's uploadDirectory/uploadFileList. However after uploading an individual file I need to perform specific operations on my HDD and DB. Is there any way (preferably implementing observers/listeners) to notify me whenever a specific file has finished uploading or am I cursed with only being able to verify if the entire MultipleFileUpload succeeded?
For whatever it's worth I'm using the Java SDK, however I should be able to adapt a .NET/REST solution to my needs.
Realizing that this isn't exactly what you asked, it's pretty sweet and seems like an appropriate solution...
S3 does have notifications you can configure to alert you when an object has been created or deleted (or if a reduced redundancy object was lost). These can go to SQS, SNS, or Lambda (which could potentially even run the code that updates the database), and of course if you send them to SNS you can then fan them out to multiple destinations.
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#notification-how-to-event-types-and-destinations
Don't make the mistake, however, of selecting only the upload subtype you assume is being used; use the s3:ObjectCreated:* event unless you have a specific reason not to.
Since from the 4 days i have been trying to find out the path for the uploaded file. I think it wont possible. Can any one tell me how to get the uploaded file path in java web application. Is there any external API to get the uploaded file path? And my project is google app engine type project. Please some one answer it.
As you can't write to the file system it's likely you can't do whatever it is you are trying to do. So you need to use one of the storage options available instead, likely GCS.
https://developers.google.com/appengine/docs/java/googlecloudstorageclient/
Google Cloud Storage is useful for storing and serving large files.
Additionally, Cloud Storage offers the use of access control lists
(ACLs), and the ability to resume upload operations if they're
interrupted, and many other features. (The GCS client library makes
use of this resume capability automatically for your app, providing
you with a robust way to stream data into GCS.)
My application requires a layer between Java and the filesystem to make transparent the fact that the filesystem only contains a subset of all the files (which are stored on S3). The layer has to do a lot of what normal file IO does, which is to open files, lock them for reading/writing, etc., but when opening it has to possibly download files and evict closed ones. Another feature I need is that if a file is locked for reading/writing, an open call can unlock the file and close the existing stream (ie, kick the other user off). Another is management of temporary files.
Is there anything remotely similar that is open source, or do I just have to roll up my sleeves? Should I start from scratch, or are there some hooks in java IO that I should tap?
I would suggest you check apache commons vfs Even if it's not exactly what you need, you may find useful ideas from it.
This project aims to simplify creating a custom file system implementation for Java 7 NIO (java7-fs-base). The author has implemented a Dropbox FS (java7-fs-dropbox), and started work on S3 (java7-fs-amazon-s3). https://github.com/fge/java7-filesystems
Perhaps AWS Storage Gateway is worth considering in your case
Gateway-Cached Volumes: You can store your primary data in Amazon S3, and retain your frequently accessed data locally. Gateway-Cached volumes provide substantial cost savings on primary storage, minimize the need to scale your storage on-premises, and retain low-latency access to your frequently accessed data.