How to zip files in Amazon s3 Bucket and get its URL - java

I have a bunch of files inside Amazon s3 bucket, I want to zip those file and download get the contents via S3 URL using Java Spring.

S3 is not a file server, nor does it offer operating system file services, such as data manipulation.
If there is many "HUGE" files, your best bet is
start a simple EC2 instance
Download all those files to EC2 instance, compress them, reupload it back to S3 bucket with a new object name
Yes, you can use AWS lambda to do the same thing, but lambda is bounds to 900 seconds (15 mins) execution timeout (Thus it is recommended to allocate more RAM to boost lambda execution performance)
Traffics from S3 to local region EC2 instance and etc services is FREE.
If your main purpose is just to read those file within same AWS region using EC2/etc services, then you don't need this extra step. Just access the file directly.
(Update) :
As mentioned by #Robert Reiz, now you can also use AWS Fargate to do the job.
Note :
It is recommended to access and share file using AWS API. If you intend to share the file publicly, you must look into security issue seriously and impose download restriction. AWS traffics out to internet is never cheap.

Zip them in your end instead of doing it in AWS, ideally in frontend, directly on user browser. You can stream the download of several files in javascript, use that stream to create a zip and save this zip on user disk.
The advantages of moving the zipping to the frontend:
You can use it with S3 URLs, a bunch of presigned links or even mixing content from different sources, some from S3, some of whatever other place.
You don't waste lambda memory, nor have to up an EC2 fargate instance, that saves money. Let the user computer do it for you.
Improves user experience - no needs to wait the zip is created to start downloading it, just start downloading meanwhile the zip is being created.
StreamSaver is useful for this purpose, but in their zipping examples (Saving multiple files as a zip) is limited by less than 4GB files as it doesn't implement zip64. You can combine StreamSaver with client-zip, that support zip64, with something like this (I haven't test this):
import { downloadZip } from 'client-zip';
import streamSaver from 'streamsaver';
const files = [
{
'name': 'file1.txt',
'input': await fetch('test.com/file1')
},
{
'name': 'file2.txt',
'input': await fetch('test.com/file2')
},
]
downloadZip(files).body.pipeTo(streamSaver.createWriteStream('final_name.zip'));
In case you choose this option, keep in mind that if you have CORS enabled in your bucket you will need to add the frontend url where the zipping is done, right in the AllowedOrigins field from your CORS configuration of your bucket.
About performance:
As #aviv-day complains in a comment this could not be suitable for all scenarios. Client-zip library has a benchmark that can give you an idea if this fit or not with your scenario. Generally, if you have a big set of small files (I don't have a number about what is big here, but I'll say something between 100 and 1000) it will take a lot of time just zipping it, and it will drain the final user CPU. Also, if you are offering the same set of files zipped for all the users, it's better zip it one and present it already zipped. Using this method of zipping in frontend works well with a limited small group of files that can dynamically change depending on user preferences about what to download. I've no really test this and I really think the bottle neck would be the network speed more than the zip process, as it happens on the fly, I don't really think that scenario with a big set of files would actually be a problem. If anyone have benchmarks about this would be nice to share with us!

Hi I recently have to do that for my application -- serve a bundle of files in zip format through a url link that the users can download.
In a nutshell, first create an object using BytesIO method, then use the ZipFile method to write into this object by iterating all the s3 objects, then use put method on this zip object and create a presiged url for it.
The code I used looks like this:
First, call this function to get the zip object, ObjectKeys are the s3 objects that you need to put into the zip file.
def zipResults(bucketName, ObjectKeys):
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip_file:
for ObjectKey in ObjectKeys:
objectContent = S3Helper().readFromS3(bucketName, ObjectKey)
fileName = os.path.basename(ObjectKey)
zip_file.writestr(fileName, objectContent)
buffer.seek(0)
return buffer
Then call this function, key is the key you give to your zip object:
def uploadObject(bucketName, body, key):
s3client = AwsHelper().getClient("s3")
try:
response = s3client.put_object(
Bucket=bucketName,
Body=body,
Key=key
)
except ClientError as e:
logging.error(e)
return None
return response
Of course, you would need io, zipfile and boto3 modules.

If you need individual files (objects) in S3 compressed, then it is possible to do so in a round-about way. You can define a CloudFront endpoint pointing to the S3 bucket, then let CloudFront compress the content on the way out: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html

Related

Storing Instances with Serialization

I am making a program that needs to save objects for retrieval at a future date. The program will be given out away as a jar file to different people.
I can already store and retrieve instances of classes when giving the Object input/output stream a absolute path (String) as a parameter.
I can also save images and text files in the resources folder and get it as a resource with getClass().getResource(String path).
Here is the problem:
I have tried every way possible to save/get Objects to/from the resources folder. It gets really weird dealing with URLS and Files and not ordinary Strings. Can someone please help me? I need to be able to save and retrieve objects relative to the classpath so that i can access the objects when the program is a jar file saved in different paths on the computer.
1: resource folder (in jar), is read-only.
You can create datas, store in the jar when you package, but after, it is finished: only to read.
2: so you want user can read and write (and it is not embedded in your app).
if it is personal datas, you can use (for PC):
String appdata= System.getenv("APPDATA");
System.out.println(appdata);
String dataFolder = System.getProperty("user.home") + "\\Local Settings\\ApplicationData";
System.out.println(dataFolder);
String dataFolder2 = System.getenv("LOCALAPPDATA");
System.out.println(dataFolder2);
on my PC, it gives:
C:\Users\develop2\AppData\Roaming
C:\Users\develop2\Local Settings\ApplicationData
C:\Users\develop2\AppData\Local
see this: What is the cross-platform way of obtaining the path to the local application data directory?
it is is for everybody, same principles, but you can encounter security issues
like this:
String programdata = System.getenv("PROGRAMDATA");
System.out.println(programdata);
String allusersprofile = System.getenv("ALLUSERSPROFILE");
System.out.println(allusersprofile); // same thing !
String publicdir = System.getenv("PUBLIC");
System.out.println(publicdir);

Return file info in addtion to file in Spring-based Web Application

I'm currently working on a small Spring-based web application with an AngularJS frontend. One can upload and download files to it, which are mirrored to some other storage. If I download a file, the application checks if all replicas are there and valid, and if not it re-uploads the file to the storage with a corrupted copy of the file.
The thing I want to achive is, that when I download the file, I want to be able to additionally transfer data about the replicas. More specifically, I want to state to the user, if the file was coruppted somewhere and had to be uploaded. And if so, where this happened.
The code I'm currently using is (I know that it's not very efficient to download from all providers everytime):
public ResponseEntity downloadFile(#RequestParam("fileName") String filename) {
*) Download the file from each storage
*) Check if all replicas are ok
*) If not find the corrupted ones and reupload the file
*) Get one of the OK copies and store it in byte array named "file"
*) Create some headers and store them in a variable named "headers"
return new ResponseEntity<>(file, headers, HttpStatus.OK);
}
What I want to know is:
Is it possible to return something that holds some additional information about the corrupted replicas and which is still handled by the browser like a normal file? So instead of returning a byte array, I would return some other magical object that holds the content of the file, with some additional data?

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

I am creating a web application which will allow the upload of shape files for use later on in the program. I want to be able to read an uploaded shapefile into memory and extract some information from it without doing any explicit writing to the disk. The framework I am using (play-framework) automatically writes a temporary file to the disk when a file is uploaded, but it nicely handles the creation and deletion of said file for me. This file does not have any extension, however, so the traditional means of reading a shapefile via Geotools, like this
public void readInShpAndDoStuff(File the_upload){
Map<String, Serializable> map = new HashMap<>();
map.put( "url", the_upload.toURI().toURL() );
DataStore dataStore = DataStoreFinder.getDataStore( map );
}
fails with an exception which states
NAME_OF_TMP_FILE_HERE is not one of the files types that is known to be associated with a shapefile
After looking at the source of Geotools I see that the file type is checked by looking at the file extension, and since this is a tmp file it has none. (Running file FILENAME shows that the OS recognizes this file as a shapefile).
So at long last my question is, is there a way to read in the shapefile without specifying the Url? Some function or constructor which takes a File object as the argument and doesn't rely on a path? Or is it too much trouble and I should just save a copy on the disk? The latter option is not preferable, since this will likely be operating on a VM server at some point and I don't want to deal with file system specific stuff.
Thanks in advance for any help!
I can't see how this is going to work for you, a shapefile (despite it's name) is a group of 3 (or more) files which share a basename and have extensions of .shp, .dbf, .sbx (and usually .prj, .sbn, .fix, .qix etc).
Is there someway to make play write the extensions with the tempfile name?

Random-access Zip file without writing it to disk

I have a 1-2GB zip file with 500-1000k entries. I need to get files by name in fraction of second, without full unpacking. If file is stored on HDD, this works fine:
public class ZipMapper {
private HashMap<String,ZipEntry> map;
private ZipFile zf;
public ZipMapper(File file) throws IOException {
map = new HashMap<>();
zf = new ZipFile(file);
Enumeration<? extends ZipEntry> en = zf.entries();
while(en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
map.put(ze.getName(), ze);
}
}
public Node getNode(String key) throws IOException {
return Node.loadFromStream(zf.getInputStream(map.get(key)));
}
}
But what can I do if program downloaded the zip file from Amazon S3 and has its InputStream (or byte array)? While downloading 1GB takes ~1 second, writing it to HDD may take some time, and it is slightly harder to handle multiple files since we don't have HDD garbage collector.
ZipInputStream does not allow to random access to entries.
It would be nice to create a virtual File in memory by byte array, but I couldn't find a way to.
You could mark the file to be deleted on exit.
If you want to go for an in-memory approach: Have a look at the new NIO.2 File API. Oracle provides a filesystem provider for zip/ jar and AFAIK ShrinkWrap provides an in-memory filesystem. You could try a combination of the two.
I've written some utility methods to copy directories and files to/from a Zip file using the NIO.2 File API (the library is Open Source):
Maven:
<dependency>
<groupId>org.softsmithy.lib</groupId>
<artifactId>softsmithy-lib-core</artifactId>
<version>0.3</version>
</dependency>
Tutorial:
http://softsmithy.sourceforge.net/lib/current/docs/tutorial/nio-file/index.html
API: CopyFileVisitor.copy
Especially PathUtils.resolve helps with resolving paths across filesystems.
You can use SecureBlackbox library, it allows ZIP operations on any seekable streams.
I think you should consider using your OS in order to create "in memory" file system (i.e - RAM drive).
In addition, take a look at the FileSystems API.
A completely different approach: If the server has the file on disk (and possibly cached in RAM already): make it give you the file(s) directly. In other words, submit which files you need and then take care to extract and deliver these on the server.
Blackbox library only has Extract(String name, String outputPath) method. Seems that it can randomly access any file in seekable zip-stream indeed, but it can't write result to byte array or return stream.
I couldn't find and documentation for ShrinkWrap. I couldn't find any suitable implementations of FileSystem/FileSystemProvider etc.
However, it turned out that Amazon EC2 instance I'm running (Large) somehow writes 1gb file to disk in ~1 second. So I just write file to the disk and use ZipFile.
If HDD would be slow, I think RAM disk would be the easiest solution.

How to clone objects in Amazon S3 using RestS3Service

so i'm trying to clone objects in a folder on my S3 (Amazon S3) account. But i was wondering if there a way to do it without having to write the file to my local system first, then uploading that file back up to S3?
eventually i want it to be fully recursive cloning folders and objects in a given bucket, but for now i'm stuck on getting it to clone efficiently.
say the bucket path is images.example.com/products/prodSku
and in that prodSku folder i have a bunch of images i want to copy to a new folder
here's what i have so far.
(note: this is written in groovy, but if you know java, it's the same thing)
try{
def s3os = restService.listObjects(bucket_name, sourcePrefix, null)
def s3o
for(def i in s3os){
s3o = get(bucket_name, i.key)
// i want to be able to do something like this, just putting the input stream
// back into s3. but i can't. from what i know now, i have to write the
// dataInputStream into a file locally, then use that file to create a new S3Object
// which is placed as the second argument in the putObject method
restService.putObject(destinationBucketName, s3o.dataInputStream)
}
}catch(S3ServiceException e)
{
println e
}
Sorry the formatting is all messed up, first time posting a message.
but any help would be greatly appreciated!
Thanks!
Not sure about JetS3t API but, the AWS SDK for Java does provide a simple copyObject method
so i ended up figuring out how to do clone the asset in s3 using JetS3t. it was simpler than i expected. i'll post it up incase anyone ever googles this question.
all do is first get the s3 object you want to clone. after you have it, call setKey(filename) on the s3 object. "filename" is the path for where you want the object to be followed by the file name itself i.e. yours3bucketname/products/assets/picture.png
after your done with that, just call putObject(bucket_name, s3object), passing the s3object that you called setKey on as the second argument.
good luck! happy programming!

Categories