Reading zip files stored in GAE Blobstore

Reading zip files stored in GAE Blobstore - java

I have followed the sample code below to upload a zip file in the blobstore. I am able to upload the zip file in but i have some concerns reading the file.
Sample Code http://code.google.com/appengine/docs/python/blobstore/overview.html#Complete_Sample_App
My zip file has 6 CSV files where my system will read the files and import the values in the datastore. However i am aware that there are some restrictions to read the file which must be less than 1MB.
Can anyone suggest how i can go about reading the zip file and process the CSV file? What will happen if my data saved in the blobstore is more than 1MB?
Hope to hear from you. Advance thank.

Individual API calls to the blobstore API must be less than 1MB, but you can read as much data as you want with multiple calls. See this blog post for an example of using BlobReader to read the contents of a zip file from the blobstore; it's written using Python, but BlobReader is available in the Java SDK too, and the same technique applies.

Related

Optimize Apache POI .xls file append

Can someone please let me know if there is a memory efficient way to append to .xls files. (Client is very insistent on .xls file for the report and I did all possible research but in vain) All I could find is that to append to existing .xls, we first have to load the entire file into memory, append data and then write it back. Is that the only way ? I can afford to give up on time to optimize memory consumption.

I am afraid that is not possible using apache poi. And I doubt that it will be possible by other libraries. Even the Microsoft applications itself needs always opening the whole file to be able to work with it.
All of the Microsoft Office file formats have a complex internal structure similar to a file system. And the parts of that internal system may have relations to each other. So one cannot simply stream data into those files and append data as it is possible with plain text files or CSV files or single XML files for example. One always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. And where should it be known when not in memory?
The modern Microsoft Office file formats are Office Open XML. This are ZIP archives containing an internal file system having a directory structure containing XML files and other files too. So one can reduce the memory footprint by reading data parts from that ZIP file system directly instead of reading all data into the memory by unzipping the ZIP file system. This is what apache poi tries with XSSF and SAX (Event API). But this is for reading only.
For the writing approach one could have parts of the data (single XML files) written to temporary files to keep them away from the memory. Then put the complete ZIP file system together from those temporary files when all writing is complete. This is what SXSSF (Streaming Usermodel API) tries to do. But this is for writing only.
When it comes to appending data to an existing Microsoft Office file, then nothing of the above is useable. Because, as said already, one always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. So the whole file system always needs to be accessible to append data parts to it and update the relationships. One could think about having all data parts (single XML files) and relationship parts in temporary files to keep them away from the memory. But I don't know any library (maybe the closed source ones like Aspose) who does this. And I doubt that will be possible in a performant way. So you would pay time for a lower memory footprint.
The older Microsoft Office file formats are binary file systems but also consists in an complex internal structure. The single parts are streams of binary records which also may have relations to each other. So the main problem is the same as with Office Open XML.
There is Event API (HSSF Only) which tries reading single record streams similiar to the event API for Office Open XML. But, of course, this is for reading only.
There is no streaming approach for writing HSSF upto now. And the reason is that the old binary Excel worksheets only provide 65,536 rows and 256 columns. So the data amount in one sheet cannot be that big. So a GB sized *.xls file should not occur at all. You should not use Excel as data exchange format for database data. This is not what a spreadsheet calculation application is made for.
But even if one would program a streaming approach for writing HSSF this would not solve your problem. Because there is still nothing for appending data to an existing *.xls file. And the problems for this are the same as with the Office Open XML file formats.

mapreduce in java - gzip input files

I'm using java, and i'm trying to write a mapreduce that will recieve as an Input a folder containing multiple gz files.
I've been looking all over but all the tutorials that i've found exmplain how to process a simple text File, but haven't found anything that solves my problem.
I've asked around at my workplace, but only got references to scala, which i'm not familier with.
Any help would be appreciated.

Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you.
So all you have to do write the logic as you would for a text file and pass in the directory which contains the .gz files as input.
But the issue with gzip files is that they are not splittable, imagine you have gzip files of each 5GB, then each mapper will process on the whole 5GB file instead of working with the default block size.

How to extract rar file (AppEngineFile) in Google App Engine

My application requires extracting rar file in Google App Engine. I can extract rar file using this library but it only supports java.io.File and does not support AppEngineFile. I can not find any solution to solve this problem. Someone have any ideas ?
Thanks in advance

The junrar library says it takes an InputStream, not a file.
You may want to try uploading your rar files to BlobStore instead of reading from the filesystem. Then you can use BlobstoreInputStream to read the data into unrar.
Note that since you can't write to the filesystem, you'll need to store the unpacked data back into the blobstore or datastore.

Dynamically generating files and providing those files for download in Google App engine

I am creating a web app in google app engine using java which dynamically generate an HTML file. The requirement is such that if the Html file size increases from a certain limit (say 3 mb), then it should be split into two files and zipped together and that zip file should be sent back as the response.
I would like help on how and where to create those temporary HTML files and then zip it, in google app engine as i guess GAE doesnt allow to write on the filesystem.
Please help!!!

You can use the blobstore like a filesystem. Experimentally, they've even added access via the File api!
https://developers.google.com/appengine/docs/java/blobstore/overview#Writing_Files_to_the_Blobstore

You could also use the Google Cloud Storage. The advantage of this one is that once the file is produced, you can easily write scripts to manipulate the files through gsutil.

Reading the content of the uploaded file in Struts2

How can I read the content of an uploaded file in the execute() method of the Action class? I am able to upload the file on the server but do not know how to read the content of that file.
Do we have to first save it on the server? Or can we read it directly?

Option 1: Create a servlet
I recommend you utilize apache commons file upload. This link has examples on how to process the uploaded file (writing it to a disk or reading it in memory if the file size is small enough) using FileItem. Another relevant example can be found here.
Option 2: using the struts s:file tag
As #BalusC mentioned in the comments below struts has a built in file upload process using s:file tag library and a tutorial of using it is provided here. Essentially the file gets uploaded to a temporary directory. However, you can override that by setting a value for the struts.multipart.saveDir property in the default.properties file. This link also mentions using Apache FileUtils to process the uploaded file afterwards which by the way is a very handy library for any File I/O work.

http://java.dzone.com/articles/struts2-tutorial-part-67 Here he explains how to make a very clear and detailed upload

Here's the standard way that provides Struts2, with an example:
http://struts.apache.org/2.0.14/docs/file-upload.html
It's quite simple and elegant (no need to mess with servletRequest.getRealPath("/") as ther other example linked by hoss).
By using the <s:file> tag (and the appropiate interceptor), Struts2 makes all the dirty work and gives you the (temporary) uploaded file as a File field in the action; you can open it or move, or whatever, as you do with any file.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.