so i'm trying to clone objects in a folder on my S3 (Amazon S3) account. But i was wondering if there a way to do it without having to write the file to my local system first, then uploading that file back up to S3?
eventually i want it to be fully recursive cloning folders and objects in a given bucket, but for now i'm stuck on getting it to clone efficiently.
say the bucket path is images.example.com/products/prodSku
and in that prodSku folder i have a bunch of images i want to copy to a new folder
here's what i have so far.
(note: this is written in groovy, but if you know java, it's the same thing)
try{
def s3os = restService.listObjects(bucket_name, sourcePrefix, null)
def s3o
for(def i in s3os){
s3o = get(bucket_name, i.key)
// i want to be able to do something like this, just putting the input stream
// back into s3. but i can't. from what i know now, i have to write the
// dataInputStream into a file locally, then use that file to create a new S3Object
// which is placed as the second argument in the putObject method
restService.putObject(destinationBucketName, s3o.dataInputStream)
}
}catch(S3ServiceException e)
{
println e
}
Sorry the formatting is all messed up, first time posting a message.
but any help would be greatly appreciated!
Thanks!
Not sure about JetS3t API but, the AWS SDK for Java does provide a simple copyObject method
so i ended up figuring out how to do clone the asset in s3 using JetS3t. it was simpler than i expected. i'll post it up incase anyone ever googles this question.
all do is first get the s3 object you want to clone. after you have it, call setKey(filename) on the s3 object. "filename" is the path for where you want the object to be followed by the file name itself i.e. yours3bucketname/products/assets/picture.png
after your done with that, just call putObject(bucket_name, s3object), passing the s3object that you called setKey on as the second argument.
good luck! happy programming!
Related
I have a bunch of files inside Amazon s3 bucket, I want to zip those file and download get the contents via S3 URL using Java Spring.
S3 is not a file server, nor does it offer operating system file services, such as data manipulation.
If there is many "HUGE" files, your best bet is
start a simple EC2 instance
Download all those files to EC2 instance, compress them, reupload it back to S3 bucket with a new object name
Yes, you can use AWS lambda to do the same thing, but lambda is bounds to 900 seconds (15 mins) execution timeout (Thus it is recommended to allocate more RAM to boost lambda execution performance)
Traffics from S3 to local region EC2 instance and etc services is FREE.
If your main purpose is just to read those file within same AWS region using EC2/etc services, then you don't need this extra step. Just access the file directly.
(Update) :
As mentioned by #Robert Reiz, now you can also use AWS Fargate to do the job.
Note :
It is recommended to access and share file using AWS API. If you intend to share the file publicly, you must look into security issue seriously and impose download restriction. AWS traffics out to internet is never cheap.
Zip them in your end instead of doing it in AWS, ideally in frontend, directly on user browser. You can stream the download of several files in javascript, use that stream to create a zip and save this zip on user disk.
The advantages of moving the zipping to the frontend:
You can use it with S3 URLs, a bunch of presigned links or even mixing content from different sources, some from S3, some of whatever other place.
You don't waste lambda memory, nor have to up an EC2 fargate instance, that saves money. Let the user computer do it for you.
Improves user experience - no needs to wait the zip is created to start downloading it, just start downloading meanwhile the zip is being created.
StreamSaver is useful for this purpose, but in their zipping examples (Saving multiple files as a zip) is limited by less than 4GB files as it doesn't implement zip64. You can combine StreamSaver with client-zip, that support zip64, with something like this (I haven't test this):
import { downloadZip } from 'client-zip';
import streamSaver from 'streamsaver';
const files = [
{
'name': 'file1.txt',
'input': await fetch('test.com/file1')
},
{
'name': 'file2.txt',
'input': await fetch('test.com/file2')
},
]
downloadZip(files).body.pipeTo(streamSaver.createWriteStream('final_name.zip'));
In case you choose this option, keep in mind that if you have CORS enabled in your bucket you will need to add the frontend url where the zipping is done, right in the AllowedOrigins field from your CORS configuration of your bucket.
About performance:
As #aviv-day complains in a comment this could not be suitable for all scenarios. Client-zip library has a benchmark that can give you an idea if this fit or not with your scenario. Generally, if you have a big set of small files (I don't have a number about what is big here, but I'll say something between 100 and 1000) it will take a lot of time just zipping it, and it will drain the final user CPU. Also, if you are offering the same set of files zipped for all the users, it's better zip it one and present it already zipped. Using this method of zipping in frontend works well with a limited small group of files that can dynamically change depending on user preferences about what to download. I've no really test this and I really think the bottle neck would be the network speed more than the zip process, as it happens on the fly, I don't really think that scenario with a big set of files would actually be a problem. If anyone have benchmarks about this would be nice to share with us!
Hi I recently have to do that for my application -- serve a bundle of files in zip format through a url link that the users can download.
In a nutshell, first create an object using BytesIO method, then use the ZipFile method to write into this object by iterating all the s3 objects, then use put method on this zip object and create a presiged url for it.
The code I used looks like this:
First, call this function to get the zip object, ObjectKeys are the s3 objects that you need to put into the zip file.
def zipResults(bucketName, ObjectKeys):
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip_file:
for ObjectKey in ObjectKeys:
objectContent = S3Helper().readFromS3(bucketName, ObjectKey)
fileName = os.path.basename(ObjectKey)
zip_file.writestr(fileName, objectContent)
buffer.seek(0)
return buffer
Then call this function, key is the key you give to your zip object:
def uploadObject(bucketName, body, key):
s3client = AwsHelper().getClient("s3")
try:
response = s3client.put_object(
Bucket=bucketName,
Body=body,
Key=key
)
except ClientError as e:
logging.error(e)
return None
return response
Of course, you would need io, zipfile and boto3 modules.
If you need individual files (objects) in S3 compressed, then it is possible to do so in a round-about way. You can define a CloudFront endpoint pointing to the S3 bucket, then let CloudFront compress the content on the way out: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
I am making a program that needs to save objects for retrieval at a future date. The program will be given out away as a jar file to different people.
I can already store and retrieve instances of classes when giving the Object input/output stream a absolute path (String) as a parameter.
I can also save images and text files in the resources folder and get it as a resource with getClass().getResource(String path).
Here is the problem:
I have tried every way possible to save/get Objects to/from the resources folder. It gets really weird dealing with URLS and Files and not ordinary Strings. Can someone please help me? I need to be able to save and retrieve objects relative to the classpath so that i can access the objects when the program is a jar file saved in different paths on the computer.
1: resource folder (in jar), is read-only.
You can create datas, store in the jar when you package, but after, it is finished: only to read.
2: so you want user can read and write (and it is not embedded in your app).
if it is personal datas, you can use (for PC):
String appdata= System.getenv("APPDATA");
System.out.println(appdata);
String dataFolder = System.getProperty("user.home") + "\\Local Settings\\ApplicationData";
System.out.println(dataFolder);
String dataFolder2 = System.getenv("LOCALAPPDATA");
System.out.println(dataFolder2);
on my PC, it gives:
C:\Users\develop2\AppData\Roaming
C:\Users\develop2\Local Settings\ApplicationData
C:\Users\develop2\AppData\Local
see this: What is the cross-platform way of obtaining the path to the local application data directory?
it is is for everybody, same principles, but you can encounter security issues
like this:
String programdata = System.getenv("PROGRAMDATA");
System.out.println(programdata);
String allusersprofile = System.getenv("ALLUSERSPROFILE");
System.out.println(allusersprofile); // same thing !
String publicdir = System.getenv("PUBLIC");
System.out.println(publicdir);
I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:
import java.io._
def saveModel(name: String, model: PipelineModel) = {
val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
oos.writeObject(model)
oos.close
}
schools.zip(bySchoolArrayModels).foreach{
case (name, model) => saveModel(name, Model)
}
I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.
How to save models to Amazon S3?
One way to save a model to HDFS is as following:
// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")
Saved model can then be loaded as:
val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()
For more details see (ref)
Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.
val logRegModel = LogisticRegressionModel.load("myModel.model")
So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.
That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).
I have an URL http ://......../somefolder/ I want to get the names of all the files inside this folder. I have tried this below code but it's showing error.
URL url = new URL("http://.............../pages/");
File f=new File(url.getFile());
String list[]=f.list();
for(String x:list)
{
System.out.println(x);
}
Error :-Exception in thread "main" java.lang.NullPointerException
at Directory.main(Directory.java:25)
It's not possible to do it like this.
HTTP has no concept of a "folder". The thing you see when you open that URL is just another web page, which happens to have a bunch of links to other pages. It's not special in any way as far as HTTP is concerned (and therefore HTTP clients, like the one built into Java).
That's not to say it's completely impossible. You might be able to get the file list another way.
Edit: The reason your code doesn't work is that it does something completely nonsensical. url.getFile() will return something like "/......./pages/", and then you pass that into the File constructor - which gives you a File representing the path /....../pages/ (or C:\......\pages\ on Windows). f.list() sees that that path doesn't exist on your computer, and returns null. There is no way to get a File that points to a URL, just like there's no way to get an int with the value 5.11.
I am working on a program that integrates Hadoop's MapReduce framework with Xuggle. For that, I am implementing a IURLProtocolHandlerFactory class that reads and writes from and to in-memory Hadoop data objects.
You can see the relevant code here:
https://gist.github.com/4191668
The idea is to register each BytesWritable object in the IURLProtocolHandlerFactory class with a UUID so that when I later refer to that name while opening the file it returns a IURLProtocolHandler instance that is attached to that BytesWritable object and I can read and write from and to memory.
The problem is that I get an exception like this:
java.lang.RuntimeException: could not open: byteswritable:d68ce8fa-c56d-4ff5-bade-a4cfb3f666fe
at com.xuggle.mediatool.MediaReader.open(MediaReader.java:637)
(see also under the posted link)
When debugging I see that the objects are correctly found in the factory, what's more, they are even being read from in the protocol handler. If I remove the listeners from/to the output file, the same happens, so the problem is already with the input. Digging deeper in the code of Xuggle I reach the JNI code (which tries to open the file) and I can't get further than this. This apparently returns an error code.
XugglerJNI.IContainer_open__SWIG_0
I would really appreciate some hint where to go next, how should I continue debugging. Maybe my implementation has a flaw, but I can't see it.
I think the problem you are running into is that a lot of the types of inputs/outputs are converted to a native file descriptor in the IContainer JNI code, but the thing you are passing cannot be converted. It may not be possible to create your own IURLProtocolHandler in this way, because it would, after a trip through XuggleIO.map(), just end up calling IContainer again and then into the IContainer JNI code which will probably try to get a native file descriptor and call avio_open().
However, there may be a couple of things that you can open in IContainer which are not files/have no file descriptors, and which would be handled correctly. The things you can open can be seen in the IContainer code, namely java.io.DataOutput and java.io.DataOutputStream (and the corresponding inputs). I recommend making your DataInput/DataOutput implementation which wraps around BytesReadable/BytesWriteable, and opening it in IContainer.
If that doesn't work, then write your inputs to a temp file and read the outputs from a temp file :)
You can copy file to local first and then try open the container:
filePath = split.getPath();
final FileSystem fileSystem = filePath.getFileSystem(job);
Path localFile = new Path(filePath.getName());
fileSystem.createNewFile(localFile);
fileSystem.copyToLocalFile(filePath, localFile);
int result = container.open(filePath.getName(), IContainer.Type.READ, null);
This code works for me in the RecordReader class.
In your case you may copy the file to local first and then try to create the MediaReader