PDF Compression Techniques

PDF Compression Techniques - java

I am trying to compress PDF document in Java. The original file size is 1.5-2 MB and we need to bring it down to less than 1 MB. I tried using iText compression on it, however the results are not that effective and file size is still greater than 1 MB.
byte[] mergedFileContent = byteArrayOS.toByteArray();
reader = new PdfReader(mergedFileContent);
PdfStamper stamper = new PdfStamper(reader, byteArrOScomp);
stamper.setFullCompression();
stamper.close();
reader.close();
Has anyone worked on something similar? Any inputs would be appreciated.

You might want to look into the official iText examples, in particular the sample HelloWorldCompression is about applying different degrees of compression both at initial PDF creation time and as a post-processing step.
The following method from that sample may help you along.
public void compressPdf(String src, String dest) throws IOException, DocumentException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest), PdfWriter.VERSION_1_5);
stamper.getWriter().setCompressionLevel(9);
int total = reader.getNumberOfPages() + 1;
for (int i = 1; i < total; i++) {
reader.setPageContent(i, reader.getPageContent(i));
}
stamper.setFullCompression();
stamper.close();
reader.close();
}
If you wonder how I found it: I googled for "itextpdf example full compression" and it was the second result. (The first find contains the same method but is not from the official iText site.)

You could gzip, zip, etc. the file afterwards. It isn't really a PDF compression format, but if you are constrained and want better compression then compressing the entire thing may have good results since it can compress meta-level data.

Related

Using POI To read/write a doc with the full POIFSFileSystem

I have the following issue, as everybody it seems, I want to replace some items with others in Word doc.
Issue with the issue is, the doc contains headers and footers which are part of the POIFSFileSystem (I know this because reading the FS / writing the doc back -without any changes- loses these informations, whereas reading the FS / writing it back as a new file doesn't).
Currently I do this :
POIFSFileSystem pfs = new POIFSFileSystem(fis);
HWPFDocument document = new HWPFDocument(pfs);
Range r1 = document.getRange();
…
document.write();
ByteArrayOutputStream bos = new ByteArrayOutputStream(50000);
pfs.writeFilesystem(bos);
pfs.close();
However this fails, with this error:
Opened read-only or via an InputStream, a Writeable File is required
If I don't rewrite the document, it works fine, but my changes are lost.
The other way around if I only save the document, not the filesystem, I lose the header/footer.
Now the problem is, how can I update the document while "saving as" the entire filesystem, or is there a way to force the document to contain everything from the file system?

The HWPF stuff is always in scratchpad because the DOC binary file format is the most horrible of all the Horrible formats. So it will really not be ready and also will be buggy in many cases.
But in your special case, your observations are not reproducible. Using apache poi 4.0.1 the HWPFDocument contains the header story, which also contains the footer stories, after creating from *.doc file. So the following works for me:
Source:
Code:
import java.io.FileInputStream;
import java.io.FileOutputStream;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.usermodel.*;
public class ReadAndWriteDOCWithHeaderFooter {
public static void main(String[] args) throws Exception {
HWPFDocument document = new HWPFDocument(new FileInputStream("TemplateDOCWithHeaderFooter.doc"));
Range bodyRange = document.getRange();
System.out.println(bodyRange);
for (int p = 0; p < bodyRange.numParagraphs(); p++) {
System.out.println(bodyRange.getParagraph(p).text());
if (bodyRange.getParagraph(p).text().contains("<<NAME>>"))
bodyRange.getParagraph(p).replaceText("<<NAME>>", "Axel Richter");
if (bodyRange.getParagraph(p).text().contains("<<DATE>>"))
bodyRange.getParagraph(p).replaceText("<<DATE>>", "12/21/1964");
if (bodyRange.getParagraph(p).text().contains("<<AMOUNT>>"))
bodyRange.getParagraph(p).replaceText("<<AMOUNT>>", "1,234.56");
System.out.println(bodyRange.getParagraph(p).text());
}
System.out.println("==============================================================================");
Range overallRange = document.getOverallRange();
System.out.println(overallRange);
for (int p = 0; p < overallRange.numParagraphs(); p++) {
System.out.println(overallRange.getParagraph(p).text()); // contains all inclusive header and footer
}
FileOutputStream out = new FileOutputStream("ResultDOCWithHeaderFooter.doc");
document.write(out);
out.close();
document.close();
}
}
Result:
So please do checking it again and tell us exactly what is not working for you. Because we need reproducing that, please do providing a minimal, complete, and verifiable example as I have done with my code.

Get size (in bytes) of a specific page in a PDF using iText

I'm using iText (v 2.1.7) and I need to find the size, in bytes, of a specific page.
I've written the following code:
public static long[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
long[] pageSizes = new long[pageCount];
for (int i = 0; i < pageCount; i++) {
pageSizes[i] = reader.getPageContent(i+1).length;
}
reader.close();
return pageSizes;
}
But it doesn't work properly. The reader.getPageContent(i+1).length; instruction returns very small values (<= 100 usually), even for large pages that are more than 1MB, so clearly this is not the correct way to do this.
But what IS the correct way? Is there one?
Note: I've already checked this question, but the offered solution consists of writing each page of the PDF to disk and then checking the file size, which is extremely inefficient and may even be wrong, since I'm assuming this would repeat the PDF header and metadata each time. I was searching for a more "proper" solution.

Well, in the end I managed to get hold of the source code for the original program that I was working with, which only accepted PDFs as input with a maximum "page size" of 1MB. Turns out... what it actually means by "page size" was fileSize / pageCount -_-^
For anyone that actually needs the precise size of a "standalone" page, with all content included, I've tested this solution and it seems to work well, tho it probably isn't very efficient as it writes out an entire PDF document for each page. Using a memory stream instead of a disk-based one helps, but I don't know how much.
public static int[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
int[] pageSizes = new int[pageCount];
for (int i = 0; i < pageCount; i++) {
try {
Document doc = new Document();
ByteArrayOutputStream bous = new ByteArrayOutputStream();
PdfCopy copy= new PdfCopy(doc, bous);
doc.open();
PdfImportedPage page = copy.getImportedPage(reader, i+1);
copy.addPage(page);
doc.close();
pageSizes[i] = bous.size();
} catch (DocumentException e) {
e.printStackTrace();
}
}
reader.close();
return pageSizes;
}

ItextSharp - diacritic chars

i reading pdf documents via ItextSharp library.
But these documents is in Czech language which use diacritic (ř ě ž š č etc.)
How I can read this chars? Any idea? Or, is some solution for replacing this chars for normal r e z s c ?
This is code in my method. Thanks
PdfReader reader = new PdfReader("M:/ShareDirs_KSP/RDM_Debtors/DMS_PROD/" + src);
// we can inspect the syntax of the imported page
String text = new String();
for (int page = 1; page <= 1; page++) {
text += PdfTextExtractor.getTextFromPage(reader, page);
}
reader.close();

I have written a small proof of concept that parses the file czech.pdf. This file contains several characters with diacritics. It was created in answer to the following question: Can't get Czech characters while generating a PDF
The text is stored in the file twice: once using a simple font, once using a composite font. In my proof of concept (named ParseCzech), I parse this PDF to a file encoded using UTF-8 (UNICODE):
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
FileOutputStream fos = new FileOutputStream(DEST);
for (int page = 1; page <= 1; page++) {
fos.write(PdfTextExtractor.getTextFromPage(reader, page).getBytes("UTF-8"));
}
fos.flush();
fos.close();
}
The result is the file czech.txt:
As you can see from the screen shot, the text is extracted correctly (but make sure that the viewer you use knows that the file is encoded as UTF-8, otherwise you may see strange characters instead of the actual text).
Note that some PDFs do not allow text to be extracted correctly. This is explained in the following video: http://www.youtube.com/watch?v=wxGEEv7ibHE
Please share your PDF so that people on StackOverflow can check whether you don't succeed to extract text because of an error in your code, or whether you don't succeed because the PDF doesn't allow you to extract the text.

"Random" generated documents fail to print

We are attempting to generate documents using iText that are formed largely from "template" files - smaller PDF files that are combined together into one composite file using the PdfContentByte.addTemplate method. We then automatically and silently print the new file using the *nix command lp. This usually works; however occasionally, some files that are generated will fail to print. The document proceeds through all queues and arrives at the printer proper (a Lexmark T652n, in this case), its physical display gives a message of pending progress, and even its mechanical components whir up in preparation - then, the printing job vanishes spontaneously without a trace, and the printer returns to being ready.
The oddity in how specific this issue tends to be. For starters, the files in question print without fail when done manually through Adobe PDF Viewer, and can be read fine by editors like Adobe Live Cycle. Furthermore, the content of the file effects whether it is plagued by this issue, but not in a clear way - adding a specific template 20 times could cause the problem, while doing it 19 or 21 times might be fine, or using a different template will change the pattern entirely and might cause it to happen instead after 37 times. Generating a document with the exact same content will be consistent on whether or not the issue occurs, but any subtle and seemingly irrelevant change in content will change whether the problem happens.
While it could be considered a hardware issue, the fact remains that certain iText-generated files have this issue while others do not. Is our method of file creation sometimes creating files that are somehow considered corrupt only to the printer and only sometimes?
Here is a relatively small code example that generates documents using the repetitive template method similar to our main program. It uses this file as a template and repeats it a specified number of times.
public class PDFFileMaker {
private static final int INCH = 72;
final private static float MARGIN_TOP = INCH / 4;
final private static float MARGIN_BOTTOM = INCH / 2;
private static final String DIREC = "/pdftest/";
private static final String OUTPUT_FILEPATH = DIREC + "cooldoc_%d.pdf";
private static final String TEMPLATE1_FILEPATH = DIREC + "template1.pdf";
private static final Rectangle PAGE_SIZE = PageSize.LETTER;
private static final Rectangle TEMPLATE_SIZE = PageSize.LETTER;
private ByteArrayOutputStream workingBuffer;
private ByteArrayOutputStream storageBuffer;
private ByteArrayOutputStream templateBuffer;
private float currPosition;
private int currPage;
private int formFillCount;
private int templateTotal;
private static final int DEFAULT_NUMBER_OF_TIMES = 23;
public static void main (String [] args) {
System.out.println("Starting...");
PDFFileMaker maker = new PDFFileMaker();
File file = null;
try {
file = maker.createPDF(DEFAULT_NUMBER_OF_TIMES);
}
catch (Exception e) {
e.printStackTrace();
}
if (file == null || !file.exists()) {
System.out.println("File failed to be created.");
}
else {
System.out.println("File creation successful.");
}
}
public File createPDF(int inCount) throws Exception {
templateTotal = inCount;
String sFilepath = String.format(OUTPUT_FILEPATH, templateTotal);
workingBuffer = new ByteArrayOutputStream();
storageBuffer = new ByteArrayOutputStream();
templateBuffer = new ByteArrayOutputStream();
startPDF();
doMainSegment();
finishPDF(sFilepath);
return new File(sFilepath);
}
private void startPDF() throws DocumentException, FileNotFoundException {
Document d = new Document(PAGE_SIZE);
PdfWriter w = PdfWriter.getInstance(d, workingBuffer);
d.open();
d.add(new Paragraph(" "));
d.close();
w.close();
currPosition = 0;
currPage = 1;
formFillCount = 1;
}
protected void finishPDF(String sFilepath) throws DocumentException, IOException {
//Transfers data from buffer 1 to builder file
PdfReader r = new PdfReader(workingBuffer.toByteArray());
PdfStamper s = new PdfStamper(r, new FileOutputStream(sFilepath));
s.setFullCompression();
r.close();
s.close();
}
private void doMainSegment() throws FileNotFoundException, IOException, DocumentException {
File fTemplate1 = new File(TEMPLATE1_FILEPATH);
for (int i = 0; i < templateTotal; i++) {
doTemplate(fTemplate1);
}
}
private void doTemplate(File f) throws FileNotFoundException, IOException, DocumentException {
PdfReader reader = new PdfReader(new FileInputStream(f));
//Transfers data from the template input file to temporary buffer
PdfStamper stamper = new PdfStamper(reader, templateBuffer);
stamper.setFormFlattening(true);
AcroFields form = stamper.getAcroFields();
//Get size of template file via looking for "end" Acrofield
float[] area = form.getFieldPositions("end");
float size = TEMPLATE_SIZE.getHeight() - MARGIN_TOP - area[4];
//Requires Page Break
if (size >= PAGE_SIZE.getHeight() - MARGIN_TOP - MARGIN_BOTTOM + currPosition) {
PdfReader subreader = new PdfReader(workingBuffer.toByteArray());
PdfStamper substamper = new PdfStamper(subreader, storageBuffer);
currPosition = 0;
currPage++;
substamper.insertPage(currPage, PAGE_SIZE);
substamper.close();
subreader.close();
workingBuffer = storageBuffer;
storageBuffer = new ByteArrayOutputStream();
}
//Set Fields
form.setField("field1", String.format("Form Text %d", formFillCount));
form.setField("page", String.format("Page %d", currPage));
formFillCount++;
stamper.close();
reader.close();
//Read from working buffer, stamp to storage buffer, stamp template from template buffer
reader = new PdfReader(workingBuffer.toByteArray());
stamper = new PdfStamper(reader, storageBuffer);
reader.close();
reader = new PdfReader(templateBuffer.toByteArray());
PdfImportedPage page = stamper.getImportedPage(reader, 1);
PdfContentByte cb = stamper.getOverContent(currPage);
cb.addTemplate(page, 0, currPosition);
stamper.close();
reader.close();
//Reset buffers - working buffer takes on storage buffer data, storage and template buffers clear
workingBuffer = storageBuffer;
storageBuffer = new ByteArrayOutputStream();
templateBuffer = new ByteArrayOutputStream();
currPosition -= size;
}
Running this program with a DEFAULT_NUMBER_OF_TIMES of 23 produces this document and causes the failure when sent to the printer. Changing it to 22 times produces this similar-looking document (simply with one less "line") which does not have the problem and prints successfully. Using a different PDF file as a template component completely changes these numbers or makes it so that it may not happen at all.
While this problem is likely too specific and with too many factors for other people to reasonably be expected to reproduce, the question of possibilities remains. What about the file generation could cause this unusual behavior? What might cause one file to be acceptable to a specific printer but another, generated in the same manner in different only in seemingly non-trivial ways, to be unacceptable? Is there a bug in iText produced by using the stamper template commands too heavily? This has been a long-running bug with us for a while now, so any assistance is appreciate; additionally, I am willing to answer questions or have extended conversations in chat as necessary in an effort to get to the bottom of this.

The design of your application more or less abuses the otherwise perfectly fine PdfStamper functionality.
Allow me to explain.
The contents of a page can be expressed as a stream object or as an array of a stream objects. When changing a page using PdfStamper, the content of this page is always an array of stream objects, consisting of the original stream object or the original array of stream objects to which extra elements are added.
By adding the same template creating a PdfStamper object over and over again, you increase the number of elements in the page contents array dramatically. You also introduce a huge number of q and Q operators that save and restore the stack. The reason why you have random behavior is clear: the memory and CPU available to process the PDF can vary from one moment to another. One time, there will be sufficient resources to deal with 20 q operators (saves the state), the next time there will only be sufficient resources to deal with 19. The problem occurs when the process runs out of resources.
While the PDFs you're creating aren't illegal according to ISO-32000-1, some PDF processors simply choke on these PDFs. iText is a toolbox that allows you to create PDFs that can make me very happy when I look under the hood, but it also allows you to create horrible PDFs if you don't use the toolbox wisely. The latter is what happened in your case.
You should solve this be reusing the PdfStamper instance instead of creating a new PdfStamper over and over again. If that's not possible, please post another question, using less words, explaining exactly what you want to achieve.
Suppose that you have many different source files with PDF snippets that need to be added to a single page. For instance: suppose that each PDF snippet was a coupon and you need to create a sheet with 30 coupons. Than you'd use a single PdfWriter instance, import pages with getImportedPage() and add them at the correct position using addTemplate().
Of course: I have no idea what your project is about. The idea of coupons of a page was inspired by your test PDF.

iText mergeFields in PdfCopy creates invalid pdf

I am working on the task of merging some input PDF documents using iText 5.4.5. The input documents may or may not contain AcroForms and I want to merge the forms as well.
I am using the example pdf files found here and this is the code example:
public class TestForms {
#Test
public void testNoForms() throws DocumentException, IOException {
test("pdf/hello.pdf", "pdf/hello_memory.pdf");
}
#Test
public void testForms() throws DocumentException, IOException {
test("pdf/subscribe.pdf", "pdf/filled_form_1.pdf");
}
private void test(String first, String second) throws DocumentException, IOException {
OutputStream out = new FileOutputStream("/tmp/out.pdf");
InputStream stream = getClass().getClassLoader().getResourceAsStream(first);
PdfReader reader = new PdfReader(new RandomAccessFileOrArray(
new RandomAccessSourceFactory().createSource(stream)), null);
InputStream stream2 = getClass().getClassLoader().getResourceAsStream(second);
PdfReader reader2 = new PdfReader(new RandomAccessFileOrArray(
new RandomAccessSourceFactory().createSource(stream2)), null);
Document pdfDocument = new Document(reader.getPageSizeWithRotation(1));
PdfCopy pdfCopy = new PdfCopy(pdfDocument, out);
pdfCopy.setFullCompression();
pdfCopy.setCompressionLevel(PdfStream.BEST_COMPRESSION);
pdfCopy.setMergeFields();
pdfDocument.open();
pdfCopy.addDocument(reader);
pdfCopy.addDocument(reader2);
pdfCopy.close();
reader.close();
reader2.close();
}
}
With input files containing forms I get a NullPointerException with or without compression enabled.
With standard input docs, the output file is created but when I open it with Acrobat it says there was a problem (14) and no content is displayed.
With standard input docs AND compression disabled the output is created and Acrobat displays it.
Questions
I previously did this using PdfCopyFields but it's now deprecated in favor of the boolean flag mergeFields in the PdfCopy, is this correct? There's no javadoc on that flag and I couldn't find documentation about it.
Assuming the answer to the previous question is Yes, is there anything wrong with my code?
Thanks

We are using PdfCopy to merge differents files, some of files may have fields. We use the version 5.5.3.0. The code is simple and it seems to work fine, BUT sometimes the result file is impossible to print!
Our code :
Public Shared Function MergeFiles(ByVal sourceFiles As List(Of Byte())) As Byte()
Dim document As New Document()
Dim output As New MemoryStream()
Dim copy As iTextSharp.text.pdf.PdfCopy = Nothing
Dim readers As New List(Of iTextSharp.text.pdf.PdfReader)
Try
copy = New iTextSharp.text.pdf.PdfCopy(document, output)
copy.SetMergeFields()
document.Open()
For fileCounter As Integer = 0 To sourceFiles.Count - 1
Dim reader As New PdfReader(sourceFiles(fileCounter))
reader.MakeRemoteNamedDestinationsLocal()
readers.Add(reader)
copy.AddDocument(reader)
Next
Catch exception As Exception
Throw exception
Finally
If copy IsNot Nothing Then copy.Close()
document.Close()
For Each reader As PdfReader In readers
reader.Close()
Next
End Try
Return output.GetBuffer()
End Function

Your usage of PdfCopy.setMergeFields() is correct and your merging code is fine.
The issues you described are because of bugs that have crept into 5.4.5. They should be fixed in rev. 6152 and the fixes will be included in the next release.
Thanks for bringing this to our attention.

Its just to say that we have the same probleme : iText mergeFields in PdfCopy creates invalid pdf. So it is still not fixed in the version 5.5.3.0

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.