I have a web application that supports download of large result sets (400K+ rows) in Excel format. Limitations that I have with Apache POI is that I have to generate entire excel file before I can stream, this is putting lot of stress on application servers.
Is there a way I can stream the excel file partially few rows at a time? This is forward only operation with single worksheet. I will not modify the cells that are already created
I can do this in CSV but formatting is a "must have" requirement here.
You may want to take a look at the Buffered-streaming SXSSF Howto, which outlines how you can write a XLSX stream by using mostly the same API as XSSF.
This works by creating a "window" into the worksheet, which moves with the cells that are currently written into. I can attest it works even when creating very large streams.
Note that IIRC this uses temporary storage for the unwritten stream and will only write once the workbook is complete.
Related
Can someone please let me know if there is a memory efficient way to append to .xls files. (Client is very insistent on .xls file for the report and I did all possible research but in vain) All I could find is that to append to existing .xls, we first have to load the entire file into memory, append data and then write it back. Is that the only way ? I can afford to give up on time to optimize memory consumption.
I am afraid that is not possible using apache poi. And I doubt that it will be possible by other libraries. Even the Microsoft applications itself needs always opening the whole file to be able to work with it.
All of the Microsoft Office file formats have a complex internal structure similar to a file system. And the parts of that internal system may have relations to each other. So one cannot simply stream data into those files and append data as it is possible with plain text files or CSV files or single XML files for example. One always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. And where should it be known when not in memory?
The modern Microsoft Office file formats are Office Open XML. This are ZIP archives containing an internal file system having a directory structure containing XML files and other files too. So one can reduce the memory footprint by reading data parts from that ZIP file system directly instead of reading all data into the memory by unzipping the ZIP file system. This is what apache poi tries with XSSF and SAX (Event API). But this is for reading only.
For the writing approach one could have parts of the data (single XML files) written to temporary files to keep them away from the memory. Then put the complete ZIP file system together from those temporary files when all writing is complete. This is what SXSSF (Streaming Usermodel API) tries to do. But this is for writing only.
When it comes to appending data to an existing Microsoft Office file, then nothing of the above is useable. Because, as said already, one always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. So the whole file system always needs to be accessible to append data parts to it and update the relationships. One could think about having all data parts (single XML files) and relationship parts in temporary files to keep them away from the memory. But I don't know any library (maybe the closed source ones like Aspose) who does this. And I doubt that will be possible in a performant way. So you would pay time for a lower memory footprint.
The older Microsoft Office file formats are binary file systems but also consists in an complex internal structure. The single parts are streams of binary records which also may have relations to each other. So the main problem is the same as with Office Open XML.
There is Event API (HSSF Only) which tries reading single record streams similiar to the event API for Office Open XML. But, of course, this is for reading only.
There is no streaming approach for writing HSSF upto now. And the reason is that the old binary Excel worksheets only provide 65,536 rows and 256 columns. So the data amount in one sheet cannot be that big. So a GB sized *.xls file should not occur at all. You should not use Excel as data exchange format for database data. This is not what a spreadsheet calculation application is made for.
But even if one would program a streaming approach for writing HSSF this would not solve your problem. Because there is still nothing for appending data to an existing *.xls file. And the problems for this are the same as with the Office Open XML file formats.
Right now in my code, I am reading a xlsx file, into an XSSFWorkbook, and then finally writing it into a database. But, when the size of xlsx file increases, it causes an outOfMemory error.
I can not increase the server size, or divide the xlsx file into pieces.
I tried loading workbook using file (instead of inputstream), but that didn't help either.
I am looking for a way to read 10k rows at a time (instead of the entire file at once) and iteratively write to the workbook and then to the database.
Is there a good way to do this with Apache POI?
POI contains something called an "eventmodel" which is designed exactly for this purpose. It's mentioned in the FAQ:
The SS eventmodel package is an API for reading Excel files without loading the whole spreadsheet into memory. It does require more knowledge on the part of the user, but reduces memory consumption by more than tenfold. It is based on the AWT event model in combination with SAX. If you need read-only access, this is the best way to do it.
However, you may want to double check first if the issue is somewhere else. Check out this item:
I think POI is using too much memory! What can I do?
This one comes up quite a lot, but often the reason isn't what you might initially think. So, the first thing to check is - what's the source of the problem? Your file? Your code? Your environment? Or Apache POI?
(If you're here, you probably think it's Apache POI. However, it often isn't! A moderate laptop, with a decent but not excessive heap size, from a standing start, can normally read or write a file with 100 columns and 100,000 rows in under a couple of seconds, including the time to start the JVM).
Apache POI ships with a few programs and a few example programs, which can be used to do some basic performance checks. For testing file generation, the class to use is in the examples package, SSPerformanceTest. Run SSPerformanceTest with arguments of the writing type (HSSF, XSSF or SXSSF), the number rows, the number of columns, and if the file should be saved. If you can't run that with 50,000 rows and 50 columns in HSSF and SXSSF in under 3 seconds, and XSSF in under 10 seconds (and ideally all 3 in less than that!), then the problem is with your environment.
Next, use the example program ToCSV to try reading the a file in with HSSF or XSSF. Related is XLSX2CSV, which uses SAX parsing for .xlsx. Run this against both your problem file, and a simple one generated by SSPerformanceTest of the same size. If this is slow, then there could be an Apache POI problem with how the file is being processed (POI makes some assumptions that might not always be right on all files). If these tests are fast, then any performance problems are in your code!
I am developing an internal system that is intended to work very much like Google Docs. The main piece I am implementing mimics their web-based Spreadsheet implementation. For multiple reasons I am not able to use Google Docs or ZK, which has a very robust Spreadsheet API. I chose POI 3.7 as a starting point for my Excel spreadsheet processing.
Currently when a user uploads an Excel spreadsheet, I take the file byte[] and store it in our db as a blob. When a user wants to view the spreadsheet, I pull out the byte[], build the Workbook, and push it to the client UI for editing. The pushing to the UI isn't my concern. When a user makes edits to the spreadsheet, I push the edits to the server and store them on a stack and only apply the updates when the user presses the "save" button. On save, I pull the workbook back out of the database, make the changes and push the Workbook back to the db. That way, I don't keep it in memory. It's no surprise that all of this is pretty fast except for when multiple users start doing this, obviously exploding Workbooks eats memory as described in other posts here.
A user will only update one tab at a time, why should I need to open the entire workbook? When a user initially uploads an excel spreadsheet, can I pull out each Sheet, convert each to a byte[] and save each as an indiviaual "worksheet" db record? The POI Sheet has a protected, "#write(Stream)" method but I would not like to get into the business of re-compiling POI. I also would not like to explode every cell as a new db entry. Would you guys do this differently in the first place?
Backend is java/spring/jdbc. For internal reasons, these are the technologies I'm stuck using.
Storing big binary blobs in the database is in itself not a good thing if performance is important. You would be much better off storing the workbooks on the disk.
I can only give you half an answer to your question and that is that you can read xslx (not xsl) files one sheet at a time using (http://poi.apache.org/apidocs/index.html?org/apache/poi/xssf/eventusermodel/XSSFReader.html) and that you can use a SAXParser to avoid holding each full sheet in memory. I don't think there is any way of saving it without creating a sheet object.
Warning Hack: One quick hack could be to use reflection to call the protected method. There is of course no guarantee that this will work in future versions of POI.
With Excel files, some things are stored at the sheet level, but other bits are stored at the workbook level. As your user edits a sheet, while most of their changes will be on the sheet part, some bits will need to touch the workbook level entities, and for that you'll need the whole file.
You might want to take a look at how SharePoint does its collaborative editing, which allows several people using Excel to work on the same file much like google docs. All the SharePoint protocol documents are publicly available, and there was an event on the docs very recently for which videos and presentations should be online soon, keep an eye on the office interop blog for when they do. In the SharePoint docs you should find the details on how Microsoft chunks up an Excel file for collaborative editing, and there's something to be said for you doing the same!
I would consider looking into saving the sheets as separate XML's in the database. If you store additional (meta)data about sheets belonging together in the database it shouldn't be too much hassle keeping them together. The reason behind using XML is that from Excel 2003 up spreadsheets can be saved as xml and therefor can easily be created by code as well.
If at one point you seem to be hitting too many walls with Apache POI, you could look into the OpenOffice API as well.
I’m looking for a – simple – solution to output big excel files. Input data comes from the database and I’d like to output the Excel file directly to disk to keep memory consumption as low as possible. I had a look to things like Apache POI or jxls but found no way to solve my problem.
And as additional information I need to generate .xls files for pre 2007 Excel, not the new .xlsx xml format. I also know I could generate CSV files but I’d prefer to generate plain Excel…
Any ideas ?
I realize my question isn't so clear, I really want to be able to write the excel file without having to keep the whole in memory...
The only way to do this efficiently is to use character-based CSV or XML (XLSX) format, because they can be written to the output line by line so that you can per saldo have only one line at once in the memory all the time. The binary-based XLS format must first be populated completely in memory before it can be written to the output and this is of course memory hogging in case of large amount of records.
I would recommend using CSV for this as it may be more efficient than XML, plus you have the advantage that the any decent database server has export capabilities for that, so that you don't need to program/include anything new in Java. I don't know which DB you're using, but if it were for example MySQL, then you could have used LOAD DATA INFILE for this.
No idea for generating a real XSL file. But you can directly write a HTML file, or a zip stream containing an OPenDocument spreadsheet (I guess MSExcel can read this later format)
JExcelAPI is often recommended as a more memory efficient alternative to Apache POI.
Related to this question, how to save many different csv files into one excel workbook with one sheet per csv ? I would like to know how to do this programmatically in Java.
You'll need some form of library for accessing Excel from Java. A Google search turned this one up:
http://j-integra.intrinsyc.com/support/com/doc/excel_example.html
An alternative is to use the XML Excel format that came into being with Office 2003. You'll end up with a XML file, but you can open it in Excel and see the different sheets.
http://www.javaworld.com/javaworld/jw-07-2004/jw-0712-officeml.html
If you want open source, the POI library can be used to generated Excel files.
A nice CSV parser is Open CSV
That should set the stage for what you are trying to do (basically use the CSV parser to get data, then write the data to an XLS file.
Take a look at the Aspose products, I've used them before when working with Excel and they saved me a huge amount of headache and time. Excel has several quirks that can make importing and exporting spreadsheets painful.
Aspose.Cells