Merging pdf:s with pdfBox creates a unnecessary large file - java

Massive amount of hit on this topic but only crappy threads :(
I merge a bunch of pdf files with pdfBox. Easy with a class for the purpose.
But the reult is a very large file. I have no exact figure now but its easy twise the size compared to a merge done by a ordinary desctop app.
Not acceptable im afraid.
The problem seems to be similar to this (split in this case, same same but diffrent):
https://issues.apache.org/jira/browse/PDFBOX-785
After some googling I think the problem is that the merge produces a barebones merged PDF file, and a large one at that, without compresson.
According this blog some java pdf libs can handle compression:
http://pdf-house.blogspot.com/
Itext handles this according with pdfstamper setFullCompression().
PDF/CompressPdfdocument.htm">http://www.java2s.com/Tutorial/Java/0419_PDF/CompressPdfdocument.htm
But i also bumped in to the ghost script project.
https://www.linux.com/news/software/applications/8229-putting-together-pdf-files
So, I need a second opinion. This ghost script seems cool, but itext does the trick according to google.
Am I on the right track? What to choose? One of the above or somthing intirely diffrent?
Tnx!

Try mixing PDFBox for merging with itext for compression.
See groovy example: http://pastebin.com/w8Rz8uha
I tested it with http://www.tobcon.ie/assets/files/test.pdf and uncompressed.pdf is 302kb and compressed.pdf is 58kb. (100 duplicated pages)

Related

PDF Compression - HTML to PDF (wkhtmltopdf)

Background
I'm working on a Scala/Java project where we convert individual HTML files to PDF files. We then merge the individual files into one larger complete PDF file.
For the converting we are using sPDF which is built on top of wkhtmltopdf. For the merging we use PDFMergerUtility.
The reasons for making individual files is a bit complicated - but it should be noted we can't make one big PDF off the bat and have to make the individual files first.
The issue
Initially we had no problems with this approach - however as the system has grown - so have the final PDF files. We went from files that were 2MB-3MB to files that are 20MB. I would like to know if there is any obvious compression methods or techniques we could use?
There is a lot of shared content across the individual files but since we're just merging them as isolated/independent files (as in, none of the content that is the same across the individual files is being reused to save space) it doesn't make a difference in bringing down the file size.
If I manually ZIP the final PDF file it greatly reduces the file size -as obviously there is a lot of repeated content.
So one option might just be to zip the PDF after I've finished the merging, but I would prefer to compress it during the merger process or conversion process.
Any ideas?
You could try Sejda to merge, it's Java, open source and based on a fork of PDFBox. It can generate PDF files using object streams (PDFBox currently doesn't support that) and, in case it doesn't reduce the size that much, you can try to pipe its 'compress' task which goes through the document removing unused resources and compressing images.
It's battle tested as engine behind PDFsam so, if you want to give it a quick test and see what's the outcome, just download PDFsam, use the merge module with your files (and compression flag on) and the result is what Sejda will generate.

pagination of xlsx file to XSSFworkbook using apache POI

Right now in my code, I am reading a xlsx file, into an XSSFWorkbook, and then finally writing it into a database. But, when the size of xlsx file increases, it causes an outOfMemory error.
I can not increase the server size, or divide the xlsx file into pieces.
I tried loading workbook using file (instead of inputstream), but that didn't help either.
I am looking for a way to read 10k rows at a time (instead of the entire file at once) and iteratively write to the workbook and then to the database.
Is there a good way to do this with Apache POI?
POI contains something called an "eventmodel" which is designed exactly for this purpose. It's mentioned in the FAQ:
The SS eventmodel package is an API for reading Excel files without loading the whole spreadsheet into memory. It does require more knowledge on the part of the user, but reduces memory consumption by more than tenfold. It is based on the AWT event model in combination with SAX. If you need read-only access, this is the best way to do it.
However, you may want to double check first if the issue is somewhere else. Check out this item:
I think POI is using too much memory! What can I do?
This one comes up quite a lot, but often the reason isn't what you might initially think. So, the first thing to check is - what's the source of the problem? Your file? Your code? Your environment? Or Apache POI?
(If you're here, you probably think it's Apache POI. However, it often isn't! A moderate laptop, with a decent but not excessive heap size, from a standing start, can normally read or write a file with 100 columns and 100,000 rows in under a couple of seconds, including the time to start the JVM).
Apache POI ships with a few programs and a few example programs, which can be used to do some basic performance checks. For testing file generation, the class to use is in the examples package, SSPerformanceTest. Run SSPerformanceTest with arguments of the writing type (HSSF, XSSF or SXSSF), the number rows, the number of columns, and if the file should be saved. If you can't run that with 50,000 rows and 50 columns in HSSF and SXSSF in under 3 seconds, and XSSF in under 10 seconds (and ideally all 3 in less than that!), then the problem is with your environment.
Next, use the example program ToCSV to try reading the a file in with HSSF or XSSF. Related is XLSX2CSV, which uses SAX parsing for .xlsx. Run this against both your problem file, and a simple one generated by SSPerformanceTest of the same size. If this is slow, then there could be an Apache POI problem with how the file is being processed (POI makes some assumptions that might not always be right on all files). If these tests are fast, then any performance problems are in your code!

Embed multiple jpeg images into EXCEL programmatically?

Dear Stackoverflowers,
Outputting an Excel spreadsheet with images is a requirement of a project I'm doing. I've done a little research and found the following (perhaps incorrect) consensus :
various python libs for creating excels sheets work well
it is possible to insert images (but only in bmp)
the "internal format" of images used in excel files, is complicated, which may be why there is no 3rd party library support for inserting normal formats like jpeg.
I don't want to use or convert to bmp. Why? BMP are not compressed well, and these will be big sheets, so I want to mitigate the size impact of images (1 per row) as much as possible.
My ideal answer comes from someone who has actually done this. The method suggested can be in Java,Ruby,Python,(but not .NET) or some other creative way of doing it.
I'm really hoping someone out there has a solution, as I anticipate this could be a tricky area (similar in complexity to playing around with PDFs, perhaps).
The Perl module Excel::Writer::XLSX can insert JPEG, PNG, and BMP images into a new Excel workbook.
I am currently porting it to a Python module called XlsxWriter and the inset_image() function is near the top of the TODO list.
Update: As of version 0.1.6 of XlsxWriter it is now possible to a add PNG/JPEG images. See the example in the documentation.
As said in the comment above, Apache POI can solve your problem.
I did a little research and this example should be useful Apache POI Excel Insert an Image

PDF Open Office or MS Word

I am new to java, I have to read a PDF, Open Office or MS Word file and make changes in the file and render as PDF document on my web page. Please someone tell me which of these file's API or SDK is easy to use and also tell me best SDK for this. So I can read, Update and render easily. file also contains Table but there is no image.
We use Apache POI to read Microsoft Office files. There are many libraries for PDF in Java. iText is something I have used. Once you pick the tools, do a selective search on Stack Overflow. There are plenty of discussions around these tools.
Depending on the types of updates you are doing, modifying PDF is going to be a problem - it's not intended for editing. You might have to find some way of converting the PDF to something first, then edit. Depending on the types of changes you want to make and the documents you are working from even editing DOC and Writer files is going to be tricky. They are all different formats.
As Jayan mentioned, iText and POI may help you a little. OpenOffice Writer documents can be edited by unzipping then modifying the XML or using the UNO API. Word documents can be editied by using MS Office automation (bad idea), converting to OpenOffice first then editing, or if DOCX, unzipping and processing the XML.
Good luck.

How can I build a document in iText and then write it to disk separately

I'm trying to write a program to benchmark the iText PDF library and I'd like to separate out disk access if at all possible. My plan is to write everything into a Document in memory, then take that document and generate a PDF from it. Trouble is, the only way I can see to write the thing to disk is to use PdfWriter, which needs to be started before I begin adding things to the document. Is there any other option? I've been looking, but it's tough to find a good reference to iText online.
There are a huge number of online resources for using iText - not sure where you've tried looking. If google doesn't quite work for you, you can always buy the iText In Action book which is a fantastic resource.
Anyway, wrap your PdfWriter in an output stream that doesn't write to file (like a ByteArrayOutputStream).
PdfWriter.getInstance(document, new ByteArrayOutputStream());
although for your purposes, you may want to hold onto the reference to the BAOS.

Categories