Huge result PDF from iText using PdfSmartCopy - java

I have (in most cases) 2 PDF files, 1 main PDF containing around 30000 pages, and another one containing 1 Page which i want to insert after every X amount of pages to the main one (depending on my seperate index file), while also adding barcodes to each page.
The problem i have is that the result PDF becomes VERY large (10GB+), while the in-files are 350Mb and the small one i want to insert <50kb.
Whats a good way to optimize the size of the PDF Im creating?
Heres the relevant parts of the code handling the PDF merge.
PdfImportedPage page;
PdfSmartCopy outPdf;
PdfSmartCopy.PageStamp stamp;
PdfReader pdfReader, pdfInsertReader;
...
outDoc = new Document();
outPdf = new PdfSmartCopy(outDoc, new FileOutputStream(outFile));
pdfToolset = new PDFToolset();
outDoc.open();
...
//loop over pages in my index-file
for (IndexPage index_page : item.pages) {
if (indexpage.insertPage){
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfReader);
outPdf.flush();
if (!currentDoc.equals(insertDoc)) {
insertDoc = currentDoc;
pdfInsertReader = new PdfReader(currentDoc);
}
} else if (!currentDoc.equals(indexpage.source_file)) {
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfInsertReader);
outPdf.flush();
if (!mainDoc.equals(currentDoc)){
mainDoc = currentDoc;
pdfReader = new PdfReader(mainDoc);
}
}
if (indexpage.insertPage)
page = outPdf.getImportedPage(pdfInsertReader, indexpage.source_page);
else
page = outPdf.getImportedPage(pdfReader, indexpage.source_page);
if (!duplex || (duplex && indexpage.nr % 2 == 1)) {
stamp = outPdf.createPageStamp(page);
stamp = pdfToolset.applyBarcode(stamp, indexpage.omr, indexpage.nr);
stamp.alterContents();
}
outPdf.addPage(page);
}
...

Related

Compress PDF after merge, Kotlin Java Spring Boot

I have a project that used to split a pdf file that uploaded by user, after split then get the same content inside pdf then merge the page base on pdf content using PDODocument and for merge pdf i use PDFMergerUtility, after marge i save the merge pdf to database using bytearray.
and, after save to DB, user also can download the pdf that already split and merge base on content and reupload when needed.
but i have found a problem, after merge the size of pdf is bigger than pdf before split.
i have try to found the solution, but not found that working to my problem, such us
Android PdfDocument file size
Is there a way to compress PDF to small size using Java?
and another else solution
is there any solution to solve my problem?
I would be glad for any help.
and here is my code
//file: MultipartFile -> file is send from front-end using API
var inpStream: InputStream = file.getInputStream()
inpStream = file.getInputStream()
pdfDocument = PDDocument.load(inpStream)
// splitting the pages of a PDF document
pagesPdf = splitter.split(pdfDocument)
val n = pdfDocument.numberOfPages
val batchSize:Int = 200
val finalBatchSize: Int = n % batchSize
val numOfBatch: Int = (n - finalBatchSize) / batchSize
val batchFinal: Int = if (finalBatchSize == 0) numOfBatch else (numOfBatch + 1)
var batchNo: Int = 1
var startPage: Int
var endPage: Int = 0
while (batchNo <= batchFinal) {
startPage = endPage + 1
if (batchNo > numOfBatch) {
endPage = endPage + finalBatchSize
} else {
endPage = endPage + batchSize
}
val splitter:Splitter = Splitter()
splitter.setStartPage(startPage)
splitter.setEndPage(endPage)
// splitting the pages of a PDF document
pagesPdf = splitter.split(pdfDocument)
batchNo++
i = startPage
var groupPage: Int = i
var pageNo = 0
var pdfMerger: PDFMergerUtility = PDFMergerUtility()
var mergedFileByteArrOut: ByteArrayOutputStream = ByteArrayOutputStream()
pdfMerger.setDestinationStream(mergedFileByteArrOut)
var fileObj:ByteArray? = null,
for (pd in pagesPdf) {
pageNo++;
if (!pd.isEncrypted) {
val stripper = PDFTextStripper()
//CODE TO GET CONTEN
if(condition1 == true){
var fileByteArrOut: ByteArrayOutputStream = ByteArrayOutputStream()
pd.save(fileByteArrOut)
pd.close()
var fileByteArrIn: ByteArrayInputStream = ByteArrayInputStream(fileByteArrOut.toByteArray())
pdfMerger.addSource(fileByteArrIn)
fileObj = fileByteArrOut.toByteArray(),
}
if(condition2 == true){
//I want to compress fileObj first before save to DB
//code to save to DB
fileObj = null
pdfMerger = PDFMergerUtility()
mergedFileByteArrOut= ByteArrayOutputStream()
pdfMerger.setDestinationStream(mergedFileByteArrOut)
}
}
}
You can use cpdf https://community.coherentpdf.com to losslessly squeeze the PDF files afterward. This will reconcile any identical object and common parts, and remove any unneeded parts.
From the command line
cpdf -squeeze in.pdf -o out.pdf
Or, from Java:
jcpdf.squeezeInMemory(pdf);

Remove links from a PDF using iText 7.1

We have a vendor that will not accept PDFs that contain links. We are trying to remove the links by removing all link annotations from each page of the PDF using iText 7.1 (Java). We have tried multiple techniques based on research. Here are three examples of attempts to detect and remove the links. None of these result in the destination PDF (test-no-links.pdf) having the links removed. Any insight would be greatly appreciated.
Example 1: Remove based on class type of annotation
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
if( annot instanceof PdfLinkAnnotation ) {
pdfPage.removeAnnotation(annot);
}
}
}
}
pdfDoc.close();
Example 2: Remove based on annotation subtype value
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
List<PdfAnnotation> annots = pdfPage.getAnnotations();
if ((annots == null) || (annots.size() == 0)) {
System.out.println("no annotations on page " + page);
}
else {
for( PdfAnnotation annot : annots ) {
// if this annotation has a link, delete it
if ( annot.getSubtype().equals(PdfName.Link) ) {
PdfDictionary annotAction = ((PdfLinkAnnotation)annot).getAction();
if( annotAction.get(PdfName.S).equals(PdfName.URI) ||
annotAction.get(PdfName.S).equals(PdfName.GoToR) ) {
PdfString uri = annotAction.getAsString(PdfName.URI);
System.out.println("Removing " + uri.toString());
pdfPage.removeAnnotation(annot);
}
}
}
}
}
pdfDoc.close();
Example 3: Remove all annotations (ignore annotation type)
String src = "test-with-links.pdf";
String dest = "test-no-links.pdf";
PdfReader reader = new PdfReader(src);
PdfWriter writer = new PdfWriter(dest);
PdfDocument pdfDoc = new PdfDocument(reader,writer);
for( int page = 1; page <= pdfDoc.getNumberOfPages(); ++page ) {
PdfPage pdfPage = pdfDoc.getPage(page);
// remove all annotations from the page regardless of type
pdfPage.getPdfObject().remove(PdfName.Annots);
}
pdfDoc.close();
Each of your tests generates a PDF without Link annotations.
Probably, though, your PDF viewer recognizes "www.qualpay.com" as (partial) URL and displays it as a link.
In detail
Your routines
All your tests successfully remove all Link annotations from your sample PDF, cf. these screen shots for the source and all three result files, in particular look for the page 1 Annots entry:
test-with-links.pdf
test-no-links.pdf
test-no-links-1.pdf
test-no-links-2.pdf
The viewer
Indeed, though, when viewing the PDF in Adobe Acrobat Reader (and also some other viewers, e.g. the built-in PDF viewers of Chrome and Edge), you'll see that "www.qualpay.com" is treated like a link.
The cause is that this is a feature of the PDF viewer! It scans the text of the PDF it displays for strings it recognizes as (a part of) some URL and displays them like links!
In Adobe Acrobat Reader you can disable this feature:
If you disable "Create links from URLs", you'll suddenly find the URLs in your result files inactive while the URL in your source file (with the link annotation) is still active.
What to do
We have a vendor that will not accept PDFs that contain links.
First discuss with your vendor what exactly he means by "PDFs that contain links". Does he mean
PDFs with Link annotations or
PDFs with URLs that common PDF viewers present like Link annotations.
In the former case you're done, your code (either variant) removes the link annotations. You may have to demonstrate to the vendor how to disable the URL recognition in Adobe Acrobat Reader, though.
In the latter case you'll have to remove everything from the text content of your PDFs that common PDF viewers recognize as URLs. You may replace each URL by a bitmap image of the URL text, or the URL text drawn like a generic vector graphic (defining a path of lines and curves and filling that), or some similar surrogate.

How to create bookmarks -table of contents- from headings with iText?

I create pdf from a list of html codes. it generates pdf very well. But I want to add table of contents to the pdf. Table of contents should be created from h1, h2 etc. How can I do it? Below is my function to create pdf. I looked to existing examples in iText site but I couldn't make it work as I want it.
public static void createMultiplePagedPdf(String destinationFile, ArrayList<String> htmlStrings,
String cssLocation, HeaderFooter headerFooter, boolean tableOfContents) {
String css = null;
ElementList list = null;
Document document = new Document();
try {
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(destinationFile));
if(headerFooter!=null)
writer.setPageEvent(headerFooter);
TableOfContent tocEvent = new TableOfContent();
writer.setPageEvent(tocEvent);
document.open();
if(tableOfContents){
tocEvent.setRoot(writer.getRootOutline());
for (TOCEntry entry : tocEvent.getToc()) {
Chunk c = new Chunk(entry.title);
c.setAction(entry.action);
document.add(new Paragraph(c));
}
}
if (cssLocation != null)
css = readCSS(cssLocation);
for (String htmlfile : htmlStrings) {
if (css != null)
list = XMLWorkerHelper.parseToElementList(htmlfile, css);
else
list = XMLWorkerHelper.parseToElementList(htmlfile, null);
for (Element e : list) {
document.add(e);
}
}
System.out.println("Pdf Created successfully");
document.close();
} catch ...
}
tocEvent.getToc() returns an empty list. When I move that if statement to the end of code it doesn't matter.
My TOCEntry and TableOfContent classes are the same as written in the first example of Creating Table of Contents using events Using iText5.
Thanks in advance!

Get size (in bytes) of a specific page in a PDF using iText

I'm using iText (v 2.1.7) and I need to find the size, in bytes, of a specific page.
I've written the following code:
public static long[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
long[] pageSizes = new long[pageCount];
for (int i = 0; i < pageCount; i++) {
pageSizes[i] = reader.getPageContent(i+1).length;
}
reader.close();
return pageSizes;
}
But it doesn't work properly. The reader.getPageContent(i+1).length; instruction returns very small values (<= 100 usually), even for large pages that are more than 1MB, so clearly this is not the correct way to do this.
But what IS the correct way? Is there one?
Note: I've already checked this question, but the offered solution consists of writing each page of the PDF to disk and then checking the file size, which is extremely inefficient and may even be wrong, since I'm assuming this would repeat the PDF header and metadata each time. I was searching for a more "proper" solution.
Well, in the end I managed to get hold of the source code for the original program that I was working with, which only accepted PDFs as input with a maximum "page size" of 1MB. Turns out... what it actually means by "page size" was fileSize / pageCount -_-^
For anyone that actually needs the precise size of a "standalone" page, with all content included, I've tested this solution and it seems to work well, tho it probably isn't very efficient as it writes out an entire PDF document for each page. Using a memory stream instead of a disk-based one helps, but I don't know how much.
public static int[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
int[] pageSizes = new int[pageCount];
for (int i = 0; i < pageCount; i++) {
try {
Document doc = new Document();
ByteArrayOutputStream bous = new ByteArrayOutputStream();
PdfCopy copy= new PdfCopy(doc, bous);
doc.open();
PdfImportedPage page = copy.getImportedPage(reader, i+1);
copy.addPage(page);
doc.close();
pageSizes[i] = bous.size();
} catch (DocumentException e) {
e.printStackTrace();
}
}
reader.close();
return pageSizes;
}

Splitting one Pdf file to multiple according to the file-size

I have been trying to split one big PDF file to multiple pdf files based on its size. I was able to split it but it only creates one single file and rest of the file data is lost. Means it does not create more than one files to split it. Can anyone please help? Here is my code
public static void main(String[] args) {
try {
PdfReader Split_PDF_By_Size = new PdfReader("C:\\Temp_Workspace\\TestZip\\input1.pdf");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream("C:\\Temp_Workspace\\TestZip\\File1.pdf"));
document.open();
int number_of_pages = Split_PDF_By_Size.getNumberOfPages();
int pagenumber = 1; /* To generate file name dynamically */
// int Find_PDF_Size; /* To get PDF size in bytes */
float combinedsize = 0; /* To convert this to Kilobytes and estimate new PDF size */
for (int i = 1; i < number_of_pages; i++ ) {
float Find_PDF_Size;
if (combinedsize == 0 && i != 1) {
document = new Document();
pagenumber++;
String FileName = "File" + pagenumber + ".pdf";
copy = new PdfCopy(document, new FileOutputStream(FileName));
document.open();
}
copy.addPage(copy.getImportedPage(Split_PDF_By_Size, i));
Find_PDF_Size = copy.getCurrentDocumentSize();
combinedsize = (float)Find_PDF_Size / 1024;
if (combinedsize > 496 || i == number_of_pages) {
document.close();
combinedsize = 0;
}
}
System.out.println("PDF Split By Size Completed. Number of Documents Created:" + pagenumber);
}
catch (Exception i)
{
System.out.println(i);
}
}
}
(BTW, it would have been great if you had tagged your question with itext, too.)
PdfCopy used to close the PdfReaders it imported pages from whenever the source PdfReader for page imports switched or the PdfCopy was closed. This was due to the original intended use case to create one target PDF from multiple source PDFs in combination with the fact that many users forget to close their PdfReaders.
Thus, after you close the first target PdfCopy, the PdfReader is closed, too, and no further pages are extracted.
If I interpret the most recent checkins into the iText SVN repository correctly, this implicit closing of PdfReaders is in the process of being removed from the code. Therefore, with one of the next iText versions, your code may work as intended.

Categories