Heap space issue while merging the document using pdfBox

Heap space issue while merging the document using pdfBox - java

I am getting java.lang.OutOfMemory error when I am trying to merge one 44k pages pdf. I am fetching all the 44k pages from my DB in chunks and trying to merge with my main document. It is processing fine till 9.5k pages and then it start throwing heap space error.
public void getDocumentAsPdf(String docid) {
PDDocument pdDocument = new PDDocument();
try {
//fetching total count from DB
Long totalPages = countByDocument(docid);
Integer batchSize = 400;
Integer skip=0;
Long totalBatches = totalPages/batchSize;
Long remainingPages = totalPages%batchSize;
for (int i = 1; i <= totalBatches; i++) {
log.info("Batch : {}", i );
//fetching pages of given document in ascending order from database
List<Page> documentPages = fetchPagesByDocument(document,batchSize,
skip);
pdDocument = mergePagesToDocument(pdDocument,documentPages);
skip+=batchSize;
}
if(remainingPages>0)
{
//fetching remaining pages of given document in ascending order from database
List<Page> documentPages = fetchPagesByDocument(document,batchSize,skip);
pdDocument = mergePagesToDocument(pdDocument,documentPages);
}
}
catch (Exception e)
{
throw new InternalErrorException("500","Exception occurred while merging! ");
}
}
Merge pdf logic
public PDDocument mergePagesToDocument(PDDocument pdDocument,List<Page> documentPages) {
try {
PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
pdfMergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
for (Page page : documentPages) {
byte[] decodedPage = java.util.Base64.getDecoder().decode(page.getPageData());
PDDocument addPage = PDDocument.load(decodedPage);
pdfMergerUtility.appendDocument(pdDocument, addPage);
addPage.close();
}
return pdDocument;
}catch (Exception e)
{
throw new InternalErrorException("500",e.getMessage());
}
}
I think there is some memory leak from my side which is causing the given issue. Any suggestion or any better approach for the same will be helpful. Thanks in advance!

It isn't exactly a memory leak, but you are trying to store whole 44k pages PDF in pdDocument variable. It might be bigger than your heap size. You can increase it with VM option -Xmx (read more here).
Alternatively you can change your approach to not load 44k pages into memory at once.

Related

Does jPod Merge PDFs by data streaming?

I am using jPod to Merge my PDF Documents. I merged 400 PDFs of each 20 Pages resulting in file of 190 MB, whereas the size of a single pdf is 38 KB. I checked for heap status in my IDE. I didn't get any Out of Memory Error. I ran the same in Apache Tomcat with almost 30 Clients. My Tomcat stopped serving the requests. Is it because, jPod doesn't use Streaming
Or due to some other reasons??
private void run() throws Throwable {
String sOutFileFullPathAndName = "/Users/test/Downloads/" + UUID.randomUUID().toString().replace("-", "");
PDDocument dstDocument = PDDocument.createNew();
for (int i = 0;i < 400; i++) {
//System.out.println(Runtime.getRuntime().freeMemory());
PDDocument srcDocument = PDDocument.createFromLocator(new FileLocator("/Users/test/Downloads/2.pdf") );
mergeDocuments(dstDocument, srcDocument);
}
FileLocator destinationLocator = new FileLocator(sOutFileFullPathAndName);
dstDocument.save(destinationLocator, null);
dstDocument.close();
}
private void mergeDocuments(PDDocument dstDocument, PDDocument srcDocument) {
PDPageTree pageTree = srcDocument.getPageTree();
int pageCount = pageTree.getCount();
for (int index = 0; index < pageCount; index++) {
PDPage srcPage = pageTree.getPageAt( index );
appendPage(dstDocument, srcPage);
srcPage = null;
}
}
private void appendPage(PDDocument document, PDPage page) {
PDResources srcResources = page.getResources();
CSContent cSContent = page.getContentStream();
PDPage newPage = (PDPage) PDPage.META.createNew();
// copy resources from source page to the newly created page
PDResources newResources = (PDResources) PDResources.META
.createFromCos(srcResources.cosGetObject().copyDeep());
newPage.setResources(newResources);
newPage.setContentStream(cSContent);
// add that new page to the destination document
document.addPageNode(newPage);
}

PDF is not simply a "stream" of page data. It is a complex data structure containing objects referencing each other. In this concrete case page trees/nodes, content streams, resources,...
jPod keeps persistent object in memory using weak references only - they can always be refreshed from the random access data. If you start updating the object structure, objects get "locked" in memory, simply because the change is not persistent and cannot longer be refreshed.
Making lots of changes without peridodically saving the result will keep the complete structure in memory - i assume that's your problem here. Saving every now and then should reduce memory footprint.
In addition, this algorithm will create a poor page tree, containing in a linear array with thousands of pages. You should try to create a balanced tree structure. Another point for optimization is resource handling. Merging resources like fonts or images may dramatically reduce target size.

Get size (in bytes) of a specific page in a PDF using iText

I'm using iText (v 2.1.7) and I need to find the size, in bytes, of a specific page.
I've written the following code:
public static long[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
long[] pageSizes = new long[pageCount];
for (int i = 0; i < pageCount; i++) {
pageSizes[i] = reader.getPageContent(i+1).length;
}
reader.close();
return pageSizes;
}
But it doesn't work properly. The reader.getPageContent(i+1).length; instruction returns very small values (<= 100 usually), even for large pages that are more than 1MB, so clearly this is not the correct way to do this.
But what IS the correct way? Is there one?
Note: I've already checked this question, but the offered solution consists of writing each page of the PDF to disk and then checking the file size, which is extremely inefficient and may even be wrong, since I'm assuming this would repeat the PDF header and metadata each time. I was searching for a more "proper" solution.

Well, in the end I managed to get hold of the source code for the original program that I was working with, which only accepted PDFs as input with a maximum "page size" of 1MB. Turns out... what it actually means by "page size" was fileSize / pageCount -_-^
For anyone that actually needs the precise size of a "standalone" page, with all content included, I've tested this solution and it seems to work well, tho it probably isn't very efficient as it writes out an entire PDF document for each page. Using a memory stream instead of a disk-based one helps, but I don't know how much.
public static int[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
int[] pageSizes = new int[pageCount];
for (int i = 0; i < pageCount; i++) {
try {
Document doc = new Document();
ByteArrayOutputStream bous = new ByteArrayOutputStream();
PdfCopy copy= new PdfCopy(doc, bous);
doc.open();
PdfImportedPage page = copy.getImportedPage(reader, i+1);
copy.addPage(page);
doc.close();
pageSizes[i] = bous.size();
} catch (DocumentException e) {
e.printStackTrace();
}
}
reader.close();
return pageSizes;
}

Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document)

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2.
While trying to split the pdf file to single pages using the following code:
PDDocument document = PDDocument.load(inputFile);
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here
I receive the following exception
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.
There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.
One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks
Edit:
This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:
1. I do not have the size problem mentioned in the aforementioned
topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
the size of each slice is an average of 80KB with total size of
30.7MB.
2. The Split throws the exception even before it returns the splitted parts.
I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.

PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.
An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)
public void execute() {
File inputFile = new File(path/to/the/file.pdf);
PDDocument document = null;
try {
document = PDDocument.load(inputFile);
int start = 1;
int end = 1;
int batchSize = 50;
int finalBatchSize = document.getNumberOfPages() % batchSize;
int noOfBatches = document.getNumberOfPages() / batchSize;
for (int i = 1; i <= noOfBatches; i++) {
start = end;
end = start + batchSize;
System.out.println("Batch: " + i + " start: " + start + " end: " + end);
split(document, start, end);
}
// handling the remaining
start = end;
end += finalBatchSize;
System.out.println("Final Batch start: " + start + " end: " + end);
split(document, start, end);
} catch (IOException e) {
e.printStackTrace();
} finally {
//close the document
}
}
private void split(PDDocument document, int start, int end) throws IOException {
List<File> fileList = new ArrayList<File>();
Splitter splitter = new Splitter();
splitter.setStartPage(start);
splitter.setEndPage(end);
List<PDDocument> splittedDocuments = splitter.split(document);
String outputPath = Config.INSTANCE.getProperty("outputPath");
PDFTextStripper stripper = new PDFTextStripper();
for (int index = 0; index < splittedDocuments.size(); index++) {
String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
PDDocument splittedDocument = splittedDocuments.get(index);
splittedDocument.save(pdfFullPath);
}
}

Huge result PDF from iText using PdfSmartCopy

I have (in most cases) 2 PDF files, 1 main PDF containing around 30000 pages, and another one containing 1 Page which i want to insert after every X amount of pages to the main one (depending on my seperate index file), while also adding barcodes to each page.
The problem i have is that the result PDF becomes VERY large (10GB+), while the in-files are 350Mb and the small one i want to insert <50kb.
Whats a good way to optimize the size of the PDF Im creating?
Heres the relevant parts of the code handling the PDF merge.
PdfImportedPage page;
PdfSmartCopy outPdf;
PdfSmartCopy.PageStamp stamp;
PdfReader pdfReader, pdfInsertReader;
...
outDoc = new Document();
outPdf = new PdfSmartCopy(outDoc, new FileOutputStream(outFile));
pdfToolset = new PDFToolset();
outDoc.open();
...
//loop over pages in my index-file
for (IndexPage index_page : item.pages) {
if (indexpage.insertPage){
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfReader);
outPdf.flush();
if (!currentDoc.equals(insertDoc)) {
insertDoc = currentDoc;
pdfInsertReader = new PdfReader(currentDoc);
}
} else if (!currentDoc.equals(indexpage.source_file)) {
currentDoc = indexpage.source_file;
outPdf.freeReader(pdfInsertReader);
outPdf.flush();
if (!mainDoc.equals(currentDoc)){
mainDoc = currentDoc;
pdfReader = new PdfReader(mainDoc);
}
}
if (indexpage.insertPage)
page = outPdf.getImportedPage(pdfInsertReader, indexpage.source_page);
else
page = outPdf.getImportedPage(pdfReader, indexpage.source_page);
if (!duplex || (duplex && indexpage.nr % 2 == 1)) {
stamp = outPdf.createPageStamp(page);
stamp = pdfToolset.applyBarcode(stamp, indexpage.omr, indexpage.nr);
stamp.alterContents();
}
outPdf.addPage(page);
}
...

Splitting one Pdf file to multiple according to the file-size

I have been trying to split one big PDF file to multiple pdf files based on its size. I was able to split it but it only creates one single file and rest of the file data is lost. Means it does not create more than one files to split it. Can anyone please help? Here is my code
public static void main(String[] args) {
try {
PdfReader Split_PDF_By_Size = new PdfReader("C:\\Temp_Workspace\\TestZip\\input1.pdf");
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileOutputStream("C:\\Temp_Workspace\\TestZip\\File1.pdf"));
document.open();
int number_of_pages = Split_PDF_By_Size.getNumberOfPages();
int pagenumber = 1; /* To generate file name dynamically */
// int Find_PDF_Size; /* To get PDF size in bytes */
float combinedsize = 0; /* To convert this to Kilobytes and estimate new PDF size */
for (int i = 1; i < number_of_pages; i++ ) {
float Find_PDF_Size;
if (combinedsize == 0 && i != 1) {
document = new Document();
pagenumber++;
String FileName = "File" + pagenumber + ".pdf";
copy = new PdfCopy(document, new FileOutputStream(FileName));
document.open();
}
copy.addPage(copy.getImportedPage(Split_PDF_By_Size, i));
Find_PDF_Size = copy.getCurrentDocumentSize();
combinedsize = (float)Find_PDF_Size / 1024;
if (combinedsize > 496 || i == number_of_pages) {
document.close();
combinedsize = 0;
}
}
System.out.println("PDF Split By Size Completed. Number of Documents Created:" + pagenumber);
}
catch (Exception i)
{
System.out.println(i);
}
}
}

(BTW, it would have been great if you had tagged your question with itext, too.)
PdfCopy used to close the PdfReaders it imported pages from whenever the source PdfReader for page imports switched or the PdfCopy was closed. This was due to the original intended use case to create one target PDF from multiple source PDFs in combination with the fact that many users forget to close their PdfReaders.
Thus, after you close the first target PdfCopy, the PdfReader is closed, too, and no further pages are extracted.
If I interpret the most recent checkins into the iText SVN repository correctly, this implicit closing of PdfReaders is in the process of being removed from the code. Therefore, with one of the next iText versions, your code may work as intended.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Heap space issue while merging the document using pdfBox - java

It isn't exactly a memory leak, but you are trying to store whole 44k pages PDF in pdDocument variable. It might be bigger than your heap size. You can increase it with VM option -Xmx (read more here). Alternatively you can change your approach to not load 44k pages into memory at once.

Related

Does jPod Merge PDFs by data streaming?

Get size (in bytes) of a specific page in a PDF using iText

Apache PDFBOX - getting java.lang.OutOfMemoryError when using split(PDDocument document)

Huge result PDF from iText using PdfSmartCopy

Splitting one Pdf file to multiple according to the file-size

Categories

Resources