Random-access Zip file without writing it to disk - java

I have a 1-2GB zip file with 500-1000k entries. I need to get files by name in fraction of second, without full unpacking. If file is stored on HDD, this works fine:
public class ZipMapper {
private HashMap<String,ZipEntry> map;
private ZipFile zf;
public ZipMapper(File file) throws IOException {
map = new HashMap<>();
zf = new ZipFile(file);
Enumeration<? extends ZipEntry> en = zf.entries();
while(en.hasMoreElements()) {
ZipEntry ze = en.nextElement();
map.put(ze.getName(), ze);
}
}
public Node getNode(String key) throws IOException {
return Node.loadFromStream(zf.getInputStream(map.get(key)));
}
}
But what can I do if program downloaded the zip file from Amazon S3 and has its InputStream (or byte array)? While downloading 1GB takes ~1 second, writing it to HDD may take some time, and it is slightly harder to handle multiple files since we don't have HDD garbage collector.
ZipInputStream does not allow to random access to entries.
It would be nice to create a virtual File in memory by byte array, but I couldn't find a way to.

You could mark the file to be deleted on exit.
If you want to go for an in-memory approach: Have a look at the new NIO.2 File API. Oracle provides a filesystem provider for zip/ jar and AFAIK ShrinkWrap provides an in-memory filesystem. You could try a combination of the two.
I've written some utility methods to copy directories and files to/from a Zip file using the NIO.2 File API (the library is Open Source):
Maven:
<dependency>
<groupId>org.softsmithy.lib</groupId>
<artifactId>softsmithy-lib-core</artifactId>
<version>0.3</version>
</dependency>
Tutorial:
http://softsmithy.sourceforge.net/lib/current/docs/tutorial/nio-file/index.html
API: CopyFileVisitor.copy
Especially PathUtils.resolve helps with resolving paths across filesystems.

You can use SecureBlackbox library, it allows ZIP operations on any seekable streams.

I think you should consider using your OS in order to create "in memory" file system (i.e - RAM drive).
In addition, take a look at the FileSystems API.

A completely different approach: If the server has the file on disk (and possibly cached in RAM already): make it give you the file(s) directly. In other words, submit which files you need and then take care to extract and deliver these on the server.

Blackbox library only has Extract(String name, String outputPath) method. Seems that it can randomly access any file in seekable zip-stream indeed, but it can't write result to byte array or return stream.
I couldn't find and documentation for ShrinkWrap. I couldn't find any suitable implementations of FileSystem/FileSystemProvider etc.
However, it turned out that Amazon EC2 instance I'm running (Large) somehow writes 1gb file to disk in ~1 second. So I just write file to the disk and use ZipFile.
If HDD would be slow, I think RAM disk would be the easiest solution.

Related

How to zip files in Amazon s3 Bucket and get its URL

I have a bunch of files inside Amazon s3 bucket, I want to zip those file and download get the contents via S3 URL using Java Spring.
S3 is not a file server, nor does it offer operating system file services, such as data manipulation.
If there is many "HUGE" files, your best bet is
start a simple EC2 instance
Download all those files to EC2 instance, compress them, reupload it back to S3 bucket with a new object name
Yes, you can use AWS lambda to do the same thing, but lambda is bounds to 900 seconds (15 mins) execution timeout (Thus it is recommended to allocate more RAM to boost lambda execution performance)
Traffics from S3 to local region EC2 instance and etc services is FREE.
If your main purpose is just to read those file within same AWS region using EC2/etc services, then you don't need this extra step. Just access the file directly.
(Update) :
As mentioned by #Robert Reiz, now you can also use AWS Fargate to do the job.
Note :
It is recommended to access and share file using AWS API. If you intend to share the file publicly, you must look into security issue seriously and impose download restriction. AWS traffics out to internet is never cheap.
Zip them in your end instead of doing it in AWS, ideally in frontend, directly on user browser. You can stream the download of several files in javascript, use that stream to create a zip and save this zip on user disk.
The advantages of moving the zipping to the frontend:
You can use it with S3 URLs, a bunch of presigned links or even mixing content from different sources, some from S3, some of whatever other place.
You don't waste lambda memory, nor have to up an EC2 fargate instance, that saves money. Let the user computer do it for you.
Improves user experience - no needs to wait the zip is created to start downloading it, just start downloading meanwhile the zip is being created.
StreamSaver is useful for this purpose, but in their zipping examples (Saving multiple files as a zip) is limited by less than 4GB files as it doesn't implement zip64. You can combine StreamSaver with client-zip, that support zip64, with something like this (I haven't test this):
import { downloadZip } from 'client-zip';
import streamSaver from 'streamsaver';
const files = [
{
'name': 'file1.txt',
'input': await fetch('test.com/file1')
},
{
'name': 'file2.txt',
'input': await fetch('test.com/file2')
},
]
downloadZip(files).body.pipeTo(streamSaver.createWriteStream('final_name.zip'));
In case you choose this option, keep in mind that if you have CORS enabled in your bucket you will need to add the frontend url where the zipping is done, right in the AllowedOrigins field from your CORS configuration of your bucket.
About performance:
As #aviv-day complains in a comment this could not be suitable for all scenarios. Client-zip library has a benchmark that can give you an idea if this fit or not with your scenario. Generally, if you have a big set of small files (I don't have a number about what is big here, but I'll say something between 100 and 1000) it will take a lot of time just zipping it, and it will drain the final user CPU. Also, if you are offering the same set of files zipped for all the users, it's better zip it one and present it already zipped. Using this method of zipping in frontend works well with a limited small group of files that can dynamically change depending on user preferences about what to download. I've no really test this and I really think the bottle neck would be the network speed more than the zip process, as it happens on the fly, I don't really think that scenario with a big set of files would actually be a problem. If anyone have benchmarks about this would be nice to share with us!
Hi I recently have to do that for my application -- serve a bundle of files in zip format through a url link that the users can download.
In a nutshell, first create an object using BytesIO method, then use the ZipFile method to write into this object by iterating all the s3 objects, then use put method on this zip object and create a presiged url for it.
The code I used looks like this:
First, call this function to get the zip object, ObjectKeys are the s3 objects that you need to put into the zip file.
def zipResults(bucketName, ObjectKeys):
buffer = BytesIO()
with zipfile.ZipFile(buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zip_file:
for ObjectKey in ObjectKeys:
objectContent = S3Helper().readFromS3(bucketName, ObjectKey)
fileName = os.path.basename(ObjectKey)
zip_file.writestr(fileName, objectContent)
buffer.seek(0)
return buffer
Then call this function, key is the key you give to your zip object:
def uploadObject(bucketName, body, key):
s3client = AwsHelper().getClient("s3")
try:
response = s3client.put_object(
Bucket=bucketName,
Body=body,
Key=key
)
except ClientError as e:
logging.error(e)
return None
return response
Of course, you would need io, zipfile and boto3 modules.
If you need individual files (objects) in S3 compressed, then it is possible to do so in a round-about way. You can define a CloudFront endpoint pointing to the S3 bucket, then let CloudFront compress the content on the way out: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html

Is it possible to read a shapefile using geotools WITHOUT specifying the url of the file?

I am creating a web application which will allow the upload of shape files for use later on in the program. I want to be able to read an uploaded shapefile into memory and extract some information from it without doing any explicit writing to the disk. The framework I am using (play-framework) automatically writes a temporary file to the disk when a file is uploaded, but it nicely handles the creation and deletion of said file for me. This file does not have any extension, however, so the traditional means of reading a shapefile via Geotools, like this
public void readInShpAndDoStuff(File the_upload){
Map<String, Serializable> map = new HashMap<>();
map.put( "url", the_upload.toURI().toURL() );
DataStore dataStore = DataStoreFinder.getDataStore( map );
}
fails with an exception which states
NAME_OF_TMP_FILE_HERE is not one of the files types that is known to be associated with a shapefile
After looking at the source of Geotools I see that the file type is checked by looking at the file extension, and since this is a tmp file it has none. (Running file FILENAME shows that the OS recognizes this file as a shapefile).
So at long last my question is, is there a way to read in the shapefile without specifying the Url? Some function or constructor which takes a File object as the argument and doesn't rely on a path? Or is it too much trouble and I should just save a copy on the disk? The latter option is not preferable, since this will likely be operating on a VM server at some point and I don't want to deal with file system specific stuff.
Thanks in advance for any help!
I can't see how this is going to work for you, a shapefile (despite it's name) is a group of 3 (or more) files which share a basename and have extensions of .shp, .dbf, .sbx (and usually .prj, .sbn, .fix, .qix etc).
Is there someway to make play write the extensions with the tempfile name?

handling about 450.000 files in a zip

My question is simple. Would Java handle a .zip file with about 450,000 files in there? The code that I wrote would not load all of the files, just one specific file would be searched in the zip, and be read line by line. The file size is about 500kb.
Would this work or will I get an OutOfMemory Exception?
Oh sry, uncompressed there about 0,5MB. Zipped are they whole files about 250mb.
Ok, the name of the Files are IDs + Date(unique) in that zip file. If i have to check a log, ill call Java and give the ID + Date and Java is reading just that one file, never more.
Edit: It works, it works very well. About 400.000 files in a zip, if u have the Memory to Zip the Files works without any problem.
Edit2: It works on Linux Filesystems witout a problem, on NTFS sometimes it crashed. NTFS has a problem with that musch files in 1 Zip.
Using the zip filesystem in Java 7, you can actually access one individual file pretty easily and open a BufferedReader on it.
First you have to create the FileSystem:
public static FileSystem getZipFileSystem(final String zipPath)
{
final Path path = Paths.get(zipPath).toAbsolutePath();
final Map<String, Object> env = new HashMap<>();
final URI uri = URI.create("jar:file:" + path.toString());
return FileSystems.newFileSystem(uri, env, null);
}
Once you have done that, you can create a BufferedReader from an entry in the zip itself:
try (
final FileSystem fs = getZipFileSystem("/path/to/the.zip");
final BufferedReader reader = Files.newBufferedReader(fs.getPath("path/to/entry"),
StandardCharsets.UTF_8);
) {
// operate on the reader
}
You could also read all lines in the entry at once using Files.readAllLines().
If you wish to copy a zip entry to a file on the filesystem, you can also do that:
Files.copy(zipfs.getPath("path/to/entry"), Paths.get("file/on/local/fs"));
Or you can directly copy the result to an OutputStream, or directly create an entry from an OutputStream...
Or even walk the entire zip using Files.walkFileTree().
Or get all the entries in a "directory" in a zip using Files.newDirectoryStream(). Note that as its name says, this is a stream; unlike File.listFiles() (which only works on files on disk anyway), this returns a iterator over the entries.
Or... Or... Or...
Note that a FileSystem needs to be .close()d.
I'm not sure that I understand what you're trying to do.
If it's 0.5 MB/file and 450,000 files, you'll need 225GB. You won't have enough memory to do all this in a single zip in memory even if you get 90% compression.
I'd recommend breaking it into manageable chunks. You'll be able to parallelize that way too, so it's not a bad idea.

Java- using an InputStream as a File

I'm trying to generate a PDF document from an uploaded ".docx" file using JODConverter.
The call to the method that generates the PDF is something like this :
File inputFile = new File("document.doc");
File outputFile = new File("document.pdf");
// connect to an OpenOffice.org instance running on port 8100
OpenOfficeConnection connection = new SocketOpenOfficeConnection(8100);
connection.connect();
// convert
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
// close the connection
connection.disconnect();
I'm using apache commons FileUpload to handle uploading the docx file, from which I can get an InputStream object. I'm aware that Java.io.File is just an abstract reference to a file in the system.
I want to avoid the disk write (saving the InputStream to disk) and the disk read (reading the saved file in JODConverter).
Is there any way I can get a File object refering to an input stream? just any other way to avoid disk IO will also do!
EDIT: I don't care if this will end up using a lot of system memory. The application is going to be hosted on a LAN with very little to zero number of parallel users.
File-based conversions are faster than stream-based ones (provided by StreamOpenOfficeDocumentConverter) but they require the OpenOffice.org service to be running locally and have the correct permissions to the files.
Try the doc to avoid disk writting:
convert(java.io.InputStream inputStream, DocumentFormat inputFormat, java.io.OutputStream outputStream, DocumentFormat outputFormat)
There is no way to do it and make the code solid. For one, the .convert() method only takes two Files as arguments.
So, this would mean you'd have to extend File, which is possible in theory, but very fragile, as you are required to delve into the library code, which can change at any time and make your extended class non functional.
(well, there is a way to avoid disk writes if you use a RAM-backed filesystem and read/write from that filesystem, of course)
Chances are that commons fileupload has written the upload to the filesystem anyhow.
Check if your FileItem is an instance of DiskFileItem. If this is the case the write implementation of DiskFileItem willl try to move the file to the file object you pass. You are not causing any extra disk io then since the write already happened.

Avoiding fragmentation when saving files to BlackBerry filesystem. Best practice?

In my application I need to save some file (a pdf) to the filesystem. My current method involves creating a directory for storing the files:
FileConnection fc = (FileConnection)Connector.open("file:///SDCard/BlackBerry/pdfs/");
if (!fc.exists())
fc.mkdir();
fc.close();
I then write to the directory with my file:
fc = (FileConnection)Connector.open("file:///SDCard/BlackBerry/pdfs/" + filename, Connector.READ_WRITE);
if (!fc.exists())
fc.create();
OutputStream outStream = fc.openOutputStream();
outStream.write(pdf);
outStream.close();
fc.close();
This all works fine, and my pdf arrives in my created directory. My question is: will I run into trouble with the fact that I have hard coded a file path as my save destination. With the BlackBerry API is it possible to retrieve a writeable folder which exists on all models/configurations?
You can query the system for the available roots using FileSystemRegistry.listRoots(). Note that it is not guaranteed that there will be an sdcard, or that it will be visible even if there is one (when in mass storage mode, for instance). I think that the only root guaranteed to be on all devices is internal storage ("file:///Store").
There's (a little) more information here.

Categories