Write Excel data directly to OutputStream (limit memory consumption) - java

I’m looking for a – simple – solution to output big excel files. Input data comes from the database and I’d like to output the Excel file directly to disk to keep memory consumption as low as possible. I had a look to things like Apache POI or jxls but found no way to solve my problem.
And as additional information I need to generate .xls files for pre 2007 Excel, not the new .xlsx xml format. I also know I could generate CSV files but I’d prefer to generate plain Excel…
Any ideas ?
I realize my question isn't so clear, I really want to be able to write the excel file without having to keep the whole in memory...

The only way to do this efficiently is to use character-based CSV or XML (XLSX) format, because they can be written to the output line by line so that you can per saldo have only one line at once in the memory all the time. The binary-based XLS format must first be populated completely in memory before it can be written to the output and this is of course memory hogging in case of large amount of records.
I would recommend using CSV for this as it may be more efficient than XML, plus you have the advantage that the any decent database server has export capabilities for that, so that you don't need to program/include anything new in Java. I don't know which DB you're using, but if it were for example MySQL, then you could have used LOAD DATA INFILE for this.

No idea for generating a real XSL file. But you can directly write a HTML file, or a zip stream containing an OPenDocument spreadsheet (I guess MSExcel can read this later format)

JExcelAPI is often recommended as a more memory efficient alternative to Apache POI.

Related

Optimize Apache POI .xls file append

Can someone please let me know if there is a memory efficient way to append to .xls files. (Client is very insistent on .xls file for the report and I did all possible research but in vain) All I could find is that to append to existing .xls, we first have to load the entire file into memory, append data and then write it back. Is that the only way ? I can afford to give up on time to optimize memory consumption.
I am afraid that is not possible using apache poi. And I doubt that it will be possible by other libraries. Even the Microsoft applications itself needs always opening the whole file to be able to work with it.
All of the Microsoft Office file formats have a complex internal structure similar to a file system. And the parts of that internal system may have relations to each other. So one cannot simply stream data into those files and append data as it is possible with plain text files or CSV files or single XML files for example. One always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. And where should it be known when not in memory?
The modern Microsoft Office file formats are Office Open XML. This are ZIP archives containing an internal file system having a directory structure containing XML files and other files too. So one can reduce the memory footprint by reading data parts from that ZIP file system directly instead of reading all data into the memory by unzipping the ZIP file system. This is what apache poi tries with XSSF and SAX (Event API). But this is for reading only.
For the writing approach one could have parts of the data (single XML files) written to temporary files to keep them away from the memory. Then put the complete ZIP file system together from those temporary files when all writing is complete. This is what SXSSF (Streaming Usermodel API) tries to do. But this is for writing only.
When it comes to appending data to an existing Microsoft Office file, then nothing of the above is useable. Because, as said already, one always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. So the whole file system always needs to be accessible to append data parts to it and update the relationships. One could think about having all data parts (single XML files) and relationship parts in temporary files to keep them away from the memory. But I don't know any library (maybe the closed source ones like Aspose) who does this. And I doubt that will be possible in a performant way. So you would pay time for a lower memory footprint.
The older Microsoft Office file formats are binary file systems but also consists in an complex internal structure. The single parts are streams of binary records which also may have relations to each other. So the main problem is the same as with Office Open XML.
There is Event API (HSSF Only) which tries reading single record streams similiar to the event API for Office Open XML. But, of course, this is for reading only.
There is no streaming approach for writing HSSF upto now. And the reason is that the old binary Excel worksheets only provide 65,536 rows and 256 columns. So the data amount in one sheet cannot be that big. So a GB sized *.xls file should not occur at all. You should not use Excel as data exchange format for database data. This is not what a spreadsheet calculation application is made for.
But even if one would program a streaming approach for writing HSSF this would not solve your problem. Because there is still nothing for appending data to an existing *.xls file. And the problems for this are the same as with the Office Open XML file formats.

pagination of xlsx file to XSSFworkbook using apache POI

Right now in my code, I am reading a xlsx file, into an XSSFWorkbook, and then finally writing it into a database. But, when the size of xlsx file increases, it causes an outOfMemory error.
I can not increase the server size, or divide the xlsx file into pieces.
I tried loading workbook using file (instead of inputstream), but that didn't help either.
I am looking for a way to read 10k rows at a time (instead of the entire file at once) and iteratively write to the workbook and then to the database.
Is there a good way to do this with Apache POI?
POI contains something called an "eventmodel" which is designed exactly for this purpose. It's mentioned in the FAQ:
The SS eventmodel package is an API for reading Excel files without loading the whole spreadsheet into memory. It does require more knowledge on the part of the user, but reduces memory consumption by more than tenfold. It is based on the AWT event model in combination with SAX. If you need read-only access, this is the best way to do it.
However, you may want to double check first if the issue is somewhere else. Check out this item:
I think POI is using too much memory! What can I do?
This one comes up quite a lot, but often the reason isn't what you might initially think. So, the first thing to check is - what's the source of the problem? Your file? Your code? Your environment? Or Apache POI?
(If you're here, you probably think it's Apache POI. However, it often isn't! A moderate laptop, with a decent but not excessive heap size, from a standing start, can normally read or write a file with 100 columns and 100,000 rows in under a couple of seconds, including the time to start the JVM).
Apache POI ships with a few programs and a few example programs, which can be used to do some basic performance checks. For testing file generation, the class to use is in the examples package, SSPerformanceTest. Run SSPerformanceTest with arguments of the writing type (HSSF, XSSF or SXSSF), the number rows, the number of columns, and if the file should be saved. If you can't run that with 50,000 rows and 50 columns in HSSF and SXSSF in under 3 seconds, and XSSF in under 10 seconds (and ideally all 3 in less than that!), then the problem is with your environment.
Next, use the example program ToCSV to try reading the a file in with HSSF or XSSF. Related is XLSX2CSV, which uses SAX parsing for .xlsx. Run this against both your problem file, and a simple one generated by SSPerformanceTest of the same size. If this is slow, then there could be an Apache POI problem with how the file is being processed (POI makes some assumptions that might not always be right on all files). If these tests are fast, then any performance problems are in your code!

Forward only Streaming Java API for Excel

I have a web application that supports download of large result sets (400K+ rows) in Excel format. Limitations that I have with Apache POI is that I have to generate entire excel file before I can stream, this is putting lot of stress on application servers.
Is there a way I can stream the excel file partially few rows at a time? This is forward only operation with single worksheet. I will not modify the cells that are already created
I can do this in CSV but formatting is a "must have" requirement here.
You may want to take a look at the Buffered-streaming SXSSF Howto, which outlines how you can write a XLSX stream by using mostly the same API as XSSF.
This works by creating a "window" into the worksheet, which moves with the cells that are currently written into. I can attest it works even when creating very large streams.
Note that IIRC this uses temporary storage for the unwritten stream and will only write once the workbook is complete.

Large Excel File - Upload and Import - Java

all
I want to upload and import a large excel files having more than one million records in to my Java program.
I can easily import small files using Apache POI in to my system , but when i starts with large files application throws and out of memory error,
i searched google and found many threads on so , i tried everything but could not get around of this.
can anybody give me solution for my particular problem, import time is not an issue for me, right now also i can bear with performance issue as well,
just want to import this data in to my existing system without oem error.
I have very good configuration on my system and java has enough memory to use so hardware is not an issue.
Thanks
You'll want to stream the data so that you don't need to store all the records in memory at once. POI does support streaming (see XSSF and SAX event API). As you read the data, ship it off to wherever you need to (database or wherever, you did not specify) -- with the streaming API you should not read all the data into memory before processing it.
You could also export the data to a CSV file, and then use a regular FileInputStream to read the file and process each record as it is read.

Saving different csv files as different sheets in a single excel workbook

Related to this question, how to save many different csv files into one excel workbook with one sheet per csv ? I would like to know how to do this programmatically in Java.
You'll need some form of library for accessing Excel from Java. A Google search turned this one up:
http://j-integra.intrinsyc.com/support/com/doc/excel_example.html
An alternative is to use the XML Excel format that came into being with Office 2003. You'll end up with a XML file, but you can open it in Excel and see the different sheets.
http://www.javaworld.com/javaworld/jw-07-2004/jw-0712-officeml.html
If you want open source, the POI library can be used to generated Excel files.
A nice CSV parser is Open CSV
That should set the stage for what you are trying to do (basically use the CSV parser to get data, then write the data to an XLS file.
Take a look at the Aspose products, I've used them before when working with Excel and they saved me a huge amount of headache and time. Excel has several quirks that can make importing and exporting spreadsheets painful.
Aspose.Cells

Categories