Optimize Apache POI .xls file append - java

Can someone please let me know if there is a memory efficient way to append to .xls files. (Client is very insistent on .xls file for the report and I did all possible research but in vain) All I could find is that to append to existing .xls, we first have to load the entire file into memory, append data and then write it back. Is that the only way ? I can afford to give up on time to optimize memory consumption.

I am afraid that is not possible using apache poi. And I doubt that it will be possible by other libraries. Even the Microsoft applications itself needs always opening the whole file to be able to work with it.
All of the Microsoft Office file formats have a complex internal structure similar to a file system. And the parts of that internal system may have relations to each other. So one cannot simply stream data into those files and append data as it is possible with plain text files or CSV files or single XML files for example. One always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. And where should it be known when not in memory?
The modern Microsoft Office file formats are Office Open XML. This are ZIP archives containing an internal file system having a directory structure containing XML files and other files too. So one can reduce the memory footprint by reading data parts from that ZIP file system directly instead of reading all data into the memory by unzipping the ZIP file system. This is what apache poi tries with XSSF and SAX (Event API). But this is for reading only.
For the writing approach one could have parts of the data (single XML files) written to temporary files to keep them away from the memory. Then put the complete ZIP file system together from those temporary files when all writing is complete. This is what SXSSF (Streaming Usermodel API) tries to do. But this is for writing only.
When it comes to appending data to an existing Microsoft Office file, then nothing of the above is useable. Because, as said already, one always needs considering the validity of the complete file system and its realtions. So the complete file system always needs to be known. So the whole file system always needs to be accessible to append data parts to it and update the relationships. One could think about having all data parts (single XML files) and relationship parts in temporary files to keep them away from the memory. But I don't know any library (maybe the closed source ones like Aspose) who does this. And I doubt that will be possible in a performant way. So you would pay time for a lower memory footprint.
The older Microsoft Office file formats are binary file systems but also consists in an complex internal structure. The single parts are streams of binary records which also may have relations to each other. So the main problem is the same as with Office Open XML.
There is Event API (HSSF Only) which tries reading single record streams similiar to the event API for Office Open XML. But, of course, this is for reading only.
There is no streaming approach for writing HSSF upto now. And the reason is that the old binary Excel worksheets only provide 65,536 rows and 256 columns. So the data amount in one sheet cannot be that big. So a GB sized *.xls file should not occur at all. You should not use Excel as data exchange format for database data. This is not what a spreadsheet calculation application is made for.
But even if one would program a streaming approach for writing HSSF this would not solve your problem. Because there is still nothing for appending data to an existing *.xls file. And the problems for this are the same as with the Office Open XML file formats.

Related

Does saving files in a ".zip" folder speed up file write time to network drive?

I know that when I write a new file to a folder that ends in ".zip" it compresses the file. This is when using BufferedOutputStream in JAVA and saving to a windows file system. I'm saving these files to a network drive, so the write time is dependent on network speed.
Will saving to a .zip folder speed up write time? In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file? Sorry if this is an ignorant question.
There are so many misconceptions in the Question, I think it is worth going through them one at a time.
I know that when I write a new file to a folder that ends in ".zip" it compresses the file.
That is not correct. Creating a file with a ".zip" suffix does not automatically make it compressed. Writing files to a directory that has ".zip" as its filename suffix (?!?) doesn't either. Not in Java. Not in other languages.
In order to get compression, the application needs to take steps to make this happen. In Java you could use ZipOutputStream to write a file in ZIP file format. However, a ZIP file is actually an "archive" format that is designed to hold multiple files in a ZIP file. If you simply trying to compress a single file, there are better alternatives; e.g. GZIPOutputStream.
(It is also possible that this so-called "ZIP folder" you are talking about is a normal ZIP file that has been "mounted" as a loopback file system. You / someone else would have had to set that up explicitly. Anyhow, if this is what is going on here, it is nothing to do with Java. It is all happening in external software and in the operating system where the ZIP is "mounted".)
This is when using BufferedOutputStream in JAVA and saving to a windows file system.
Erm ... no. See above. However you are correct that it may be better to use a BufferedOutputStream to write files, though it only really helps if your application is writing the files in small chunks; e.g. a byte at a time. (Stream compression complicates the issue, so it is difficult to give a simple, general answer on this.)
I'm saving these files to a network drive, so the write time is dependent on network speed.
Correct. It is also dependent on network latency, the protocols used and the load on the remote file server. (If you have a ZIP "mounted", then that is going to add overheads too.)
Will saving to a .zip folder speed up write time?
Maybe. See above. It depends what you mean by a ZIP folder.
Ignoring that, writing the files (the right way) in compressed and / or archive form from Java may speed up writes. There are actually two things to consider:
For plain compression, you are trading off the time it takes the application (!!) to compress and decompress the data against the time (and disk space) you are saving by moving and storing less bytes.
For ZIP files (and similar archive formats) there is a second potential saving. Storing and retrieving lots of individual small files from a file system is slow compared with storing and retrieving a single ZIP file containing those files.
And if you are looking for optimal compression, then ZIP is not the best option.
In other words, does it transfer the data uncompressed and then compresses it (so it wouldn't speed up write time) or does it compress then write out the file?
There are so many variables that it is hard to say for sure. But unless you have done something odd, it is likely that the bytes are sent over the network in compressed form.
Finally, I would advise you NOT to try to combine mounted ZIP files and network shares:
The combination of the two could potentially interact in ways that makes performance worse.
There is a risk that you will end up with a corrupted ZIP or lost files if the network share goes offline at an inconvenient point.

PDF Compression - HTML to PDF (wkhtmltopdf)

Background
I'm working on a Scala/Java project where we convert individual HTML files to PDF files. We then merge the individual files into one larger complete PDF file.
For the converting we are using sPDF which is built on top of wkhtmltopdf. For the merging we use PDFMergerUtility.
The reasons for making individual files is a bit complicated - but it should be noted we can't make one big PDF off the bat and have to make the individual files first.
The issue
Initially we had no problems with this approach - however as the system has grown - so have the final PDF files. We went from files that were 2MB-3MB to files that are 20MB. I would like to know if there is any obvious compression methods or techniques we could use?
There is a lot of shared content across the individual files but since we're just merging them as isolated/independent files (as in, none of the content that is the same across the individual files is being reused to save space) it doesn't make a difference in bringing down the file size.
If I manually ZIP the final PDF file it greatly reduces the file size -as obviously there is a lot of repeated content.
So one option might just be to zip the PDF after I've finished the merging, but I would prefer to compress it during the merger process or conversion process.
Any ideas?
You could try Sejda to merge, it's Java, open source and based on a fork of PDFBox. It can generate PDF files using object streams (PDFBox currently doesn't support that) and, in case it doesn't reduce the size that much, you can try to pipe its 'compress' task which goes through the document removing unused resources and compressing images.
It's battle tested as engine behind PDFsam so, if you want to give it a quick test and see what's the outcome, just download PDFsam, use the merge module with your files (and compression flag on) and the result is what Sejda will generate.

Large Excel File - Upload and Import - Java

all
I want to upload and import a large excel files having more than one million records in to my Java program.
I can easily import small files using Apache POI in to my system , but when i starts with large files application throws and out of memory error,
i searched google and found many threads on so , i tried everything but could not get around of this.
can anybody give me solution for my particular problem, import time is not an issue for me, right now also i can bear with performance issue as well,
just want to import this data in to my existing system without oem error.
I have very good configuration on my system and java has enough memory to use so hardware is not an issue.
Thanks
You'll want to stream the data so that you don't need to store all the records in memory at once. POI does support streaming (see XSSF and SAX event API). As you read the data, ship it off to wherever you need to (database or wherever, you did not specify) -- with the streaming API you should not read all the data into memory before processing it.
You could also export the data to a CSV file, and then use a regular FileInputStream to read the file and process each record as it is read.

TrueZip Random Access Functionality

I'm trying to understand how to randomly traverse a file/files in a .tar.gz using TrueZIP in a Java 6 environment( using the Files classes). I found instances where it uses Java 7's Path, however, I can't come up with an example on how to randomly read an archive on Java 6.
Additionally, does "random" reading mean that it first uncompresses the entire archive, or does it read sections in the compressed file? The purpose is that I want to retrieve some basic information from the file without having to uncompress the entire thing just to read it(ie username).
The method that gzip uses to compress a file (especially .tar.gz files) usually implies that the output file is not random-accessible - you need the symbol table and other context from the entire file up to the current block to even be able to uncompress that block to see what's in it. This is one of the ways it achieves (somewhat) better compression over ZIP/pkzip, which compress each file individually before adding them to a container archive, resulting in the ability to seek to a specific file and uncompress just that file.
So, in order to pick a .tar.gz apart, you will need to uncompress the whole thing, either to a temporary file or in memory (if it's not too large), then you can jump to specific entries in the underlying .tar file, although that has to be done sequentially by skipping from header to header, as tar does not include a central index/directory of files.
I am not aware of TrueZip in particular, but at least in terms of Zip, RAR and Tar you can access single files and retrieve details about them and even extract them without touching the rest of the package.
Additionally, does "random" reading mean that it first uncompresses
the entire archive
If TrueZip follows Zip/RAR/Tar format, then it does not uncompress the entire archive.
The purpose is that I want to retrieve some basic information from the
file without having to uncompress the entire thing just to read it(ie
username).
As previously, that should be fine -- I don't know TrueZip API in particular, but file container formats allow you to inspect file info without reading a single bit of the data, and optionally extract/read the file contents without touching any other file in the container.
The source code comment of zran describes how such tools are working:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c
In conclusion one can say that the complete file has to be processed for generating the necessary index.
That is much faster than actually decompress everything.
The index allows to split the file into blocks that can be decompressed without having to decompress the blocks before. That is used for emulating random access.

Write Excel data directly to OutputStream (limit memory consumption)

I’m looking for a – simple – solution to output big excel files. Input data comes from the database and I’d like to output the Excel file directly to disk to keep memory consumption as low as possible. I had a look to things like Apache POI or jxls but found no way to solve my problem.
And as additional information I need to generate .xls files for pre 2007 Excel, not the new .xlsx xml format. I also know I could generate CSV files but I’d prefer to generate plain Excel…
Any ideas ?
I realize my question isn't so clear, I really want to be able to write the excel file without having to keep the whole in memory...
The only way to do this efficiently is to use character-based CSV or XML (XLSX) format, because they can be written to the output line by line so that you can per saldo have only one line at once in the memory all the time. The binary-based XLS format must first be populated completely in memory before it can be written to the output and this is of course memory hogging in case of large amount of records.
I would recommend using CSV for this as it may be more efficient than XML, plus you have the advantage that the any decent database server has export capabilities for that, so that you don't need to program/include anything new in Java. I don't know which DB you're using, but if it were for example MySQL, then you could have used LOAD DATA INFILE for this.
No idea for generating a real XSL file. But you can directly write a HTML file, or a zip stream containing an OPenDocument spreadsheet (I guess MSExcel can read this later format)
JExcelAPI is often recommended as a more memory efficient alternative to Apache POI.

Categories