Does jPod Merge PDFs by data streaming?

Does jPod Merge PDFs by data streaming? - java

I am using jPod to Merge my PDF Documents. I merged 400 PDFs of each 20 Pages resulting in file of 190 MB, whereas the size of a single pdf is 38 KB. I checked for heap status in my IDE. I didn't get any Out of Memory Error. I ran the same in Apache Tomcat with almost 30 Clients. My Tomcat stopped serving the requests. Is it because, jPod doesn't use Streaming
Or due to some other reasons??
private void run() throws Throwable {
String sOutFileFullPathAndName = "/Users/test/Downloads/" + UUID.randomUUID().toString().replace("-", "");
PDDocument dstDocument = PDDocument.createNew();
for (int i = 0;i < 400; i++) {
//System.out.println(Runtime.getRuntime().freeMemory());
PDDocument srcDocument = PDDocument.createFromLocator(new FileLocator("/Users/test/Downloads/2.pdf") );
mergeDocuments(dstDocument, srcDocument);
}
FileLocator destinationLocator = new FileLocator(sOutFileFullPathAndName);
dstDocument.save(destinationLocator, null);
dstDocument.close();
}
private void mergeDocuments(PDDocument dstDocument, PDDocument srcDocument) {
PDPageTree pageTree = srcDocument.getPageTree();
int pageCount = pageTree.getCount();
for (int index = 0; index < pageCount; index++) {
PDPage srcPage = pageTree.getPageAt( index );
appendPage(dstDocument, srcPage);
srcPage = null;
}
}
private void appendPage(PDDocument document, PDPage page) {
PDResources srcResources = page.getResources();
CSContent cSContent = page.getContentStream();
PDPage newPage = (PDPage) PDPage.META.createNew();
// copy resources from source page to the newly created page
PDResources newResources = (PDResources) PDResources.META
.createFromCos(srcResources.cosGetObject().copyDeep());
newPage.setResources(newResources);
newPage.setContentStream(cSContent);
// add that new page to the destination document
document.addPageNode(newPage);
}

PDF is not simply a "stream" of page data. It is a complex data structure containing objects referencing each other. In this concrete case page trees/nodes, content streams, resources,...
jPod keeps persistent object in memory using weak references only - they can always be refreshed from the random access data. If you start updating the object structure, objects get "locked" in memory, simply because the change is not persistent and cannot longer be refreshed.
Making lots of changes without peridodically saving the result will keep the complete structure in memory - i assume that's your problem here. Saving every now and then should reduce memory footprint.
In addition, this algorithm will create a poor page tree, containing in a linear array with thousands of pages. You should try to create a balanced tree structure. Another point for optimization is resource handling. Merging resources like fonts or images may dramatically reduce target size.

Related

Heap space issue while merging the document using pdfBox

I am getting java.lang.OutOfMemory error when I am trying to merge one 44k pages pdf. I am fetching all the 44k pages from my DB in chunks and trying to merge with my main document. It is processing fine till 9.5k pages and then it start throwing heap space error.
public void getDocumentAsPdf(String docid) {
PDDocument pdDocument = new PDDocument();
try {
//fetching total count from DB
Long totalPages = countByDocument(docid);
Integer batchSize = 400;
Integer skip=0;
Long totalBatches = totalPages/batchSize;
Long remainingPages = totalPages%batchSize;
for (int i = 1; i <= totalBatches; i++) {
log.info("Batch : {}", i );
//fetching pages of given document in ascending order from database
List<Page> documentPages = fetchPagesByDocument(document,batchSize,
skip);
pdDocument = mergePagesToDocument(pdDocument,documentPages);
skip+=batchSize;
}
if(remainingPages>0)
{
//fetching remaining pages of given document in ascending order from database
List<Page> documentPages = fetchPagesByDocument(document,batchSize,skip);
pdDocument = mergePagesToDocument(pdDocument,documentPages);
}
}
catch (Exception e)
{
throw new InternalErrorException("500","Exception occurred while merging! ");
}
}
Merge pdf logic
public PDDocument mergePagesToDocument(PDDocument pdDocument,List<Page> documentPages) {
try {
PDFMergerUtility pdfMergerUtility = new PDFMergerUtility();
pdfMergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
for (Page page : documentPages) {
byte[] decodedPage = java.util.Base64.getDecoder().decode(page.getPageData());
PDDocument addPage = PDDocument.load(decodedPage);
pdfMergerUtility.appendDocument(pdDocument, addPage);
addPage.close();
}
return pdDocument;
}catch (Exception e)
{
throw new InternalErrorException("500",e.getMessage());
}
}
I think there is some memory leak from my side which is causing the given issue. Any suggestion or any better approach for the same will be helpful. Thanks in advance!

It isn't exactly a memory leak, but you are trying to store whole 44k pages PDF in pdDocument variable. It might be bigger than your heap size. You can increase it with VM option -Xmx (read more here).
Alternatively you can change your approach to not load 44k pages into memory at once.

Out of Memory issue (Heap) from generating large csv file

I have an application for users to get data from database and download as csv file.
The general workflow follows:
User click download button at frontend.
Backend (SpringBoot in this case) will start an async thread to get data from database.
Generate csv files with data from step (2) and upload to google cloud storage.
Send user an email with signed url to download the data.
My problem is backend keep throwing "OOM Java heap space" error under some extreme cases. For extreme case, all my memory was filled (4GB). My initial plan was to load data via pagination from database (not all at once to save memory), and generate a csv for each page data. In this case, GC will clear the memory once a csv was generated to keep whole memory usage is not that high. However, the actual case is memory is increasing all the time until all are used up. The GC does not work as expected. I got total 18 pages and around 200000 record (from db) per page at extreme case.
I used JProfiler to monitor heap usage and found that the retained size of those large byte[] objects are not 0 which might represent there exist some references link to them (I guess that's why GC does not clear them from memory as expected).
How should I optimize my code and VM environment to make sure the memory usage can be lower than 1GB for extreme case? What makes those large byte[] objects not cleared by GC as expected?
The code to get data from database and generate csv file
#Override
#Async
#Transactional(timeout = DOWNLOAD_DATA_TRANSACTION_TIME_LIMIT)
public void startDownloadDataInCSVBySearchQuery(SearchQuery query, DownloadRequestRecord downloadRecord) throws IOException {
logger.debug(Thread.currentThread().getName() + ": starts to process download data");
String username = downloadRecord.getUsername();
// get posts from database first
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
int postsSize = postsIDs.size();
// do pagination db search. For each page, there are 1500 posts
int numPages = postsSize / POSTS_COUNT_PER_PAGE + 1;
for (int i = 0; i < numPages; i++) {
logger.debug("Download comments: start at page {}, out of total page {}", i + 1, numPages);
int pageStartPos = i * POSTS_COUNT_PER_PAGE; // this is set to 1500
int pageEndPos = Math.min((i + 1) * POSTS_COUNT_PER_PAGE, postsSize);
// get post ids per page
List<String> postsIDsPerPage = postsIDs.subList(pageStartPos, pageEndPos);
// use posts ids to get corresponding comments from db, via sql "IN"
List<Comment> commentsPerPage = this.commentsService.getCommentsByPostsIDs(postsIDsPerPage);
// generate csv file for page data and upload to google cloud
String commentsFileName = "comments-" + downloadRecord.getDownloadTime() + "-" + (i + 1) + ".csv";
this.csvUtil.generateCommentsCsvFileStream(commentsPerPage, commentsFileName, out);
this.googleCloudStorageInstance.uploadDownloadOutputStreamData(out.toByteArray(), commentsFileName);
}
} catch (Exception ex) {
logger.error("Exception from downloading data: ", ex);
}
Code to generate csv file
// use Apache csv
public void generateCommentsCsvFileStream(List<Comment> comments, String filename, ByteArrayOutputStream out) throws IOException {
CSVPrinter csvPrinter = new CSVPrinter(new OutputStreamWriter(out), CSVFormat.DEFAULT.withHeader(PostHeaders.class).withQuoteMode(QuoteMode.MINIMAL));
for (Comment comment: comments) {
List<Object> record = Arrays.asList(
// write csv content
comment.getPageId(),
...
);
csvPrinter.printRecord(record);
}
// close printer to release memory
csvPrinter.flush();
csvPrinter.close();
}
Code to upload file to goole cloud storage
public Blob uploadDownloadOutputStreamData(byte[] fileStream, String filename) {
logger.debug("Upload file: '{}' to google cloud storage", filename);
BlobId blobId = BlobId.of(this.DownloadDataBucketName, filename);
BlobInfo blobInfo = BlobInfo.newBuilder(blobId).build();
return this.cloudStorage.create(blobInfo, fileStream);
}
The heap usage is increasing all the time as page increases.The G1 old gen heap usage is still very high after system crush.
The G1 Eden space is almost empty, big files are saved into Old gen directly.
Old gen GC activity is low, most of GC activities come from Eden space:
Heap walker shows the retained size of those big byte[] is not 0.

You're using a single instance of ByteArrayOutputStream which just writes to a in-memory byte array.
That looks like a mistake because you seem to only want to upload each page at a time, not the accumulated result so far (which includes ALL pages).
By the way, doing this is useless:
try (ByteArrayOutputStream out = new ByteArrayOutputStream())
ByteArrayOutputStream does not need to be closed as it lives in memory. Just remove that. And create a new instance for each page (inside the pages for loop) instead of re-using the same instance for all pages and it might just work fine.
EDIT
Another advice would be to break this code up into more methods... not just because it's more readable with smaller methods, but because you're keeping temporary variables in scope for too long (causing unnecessary memory to stick around longer than needed).
For example:
List<? extends SocialPost> posts = this.postsService.getPosts(query);
try (ByteArrayOutputStream out = new ByteArrayOutputStream()) {
// get ids of posts
List<String> postsIDs = this.getPostsIDsFromPosts(posts);
....
From this point on, posts is not used anymore, and I assume that it contains a lot of stuff... so you should "drop" that variable once you got the IDs.
Do something like this instead:
List<String> postsIDs = getAllPostIds(query);
....
List<String> getAllPostIds(SearchQuery query) {
// this variable will be GC'd after this method returns as it's no longer referenced (assuming getPostIDsFromPosts() doesn't store it in a field)
List<? extends SocialPost> posts = this.postsService.getPosts(query);
return this.getPostsIDsFromPosts(posts);
}

Get size (in bytes) of a specific page in a PDF using iText

I'm using iText (v 2.1.7) and I need to find the size, in bytes, of a specific page.
I've written the following code:
public static long[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
long[] pageSizes = new long[pageCount];
for (int i = 0; i < pageCount; i++) {
pageSizes[i] = reader.getPageContent(i+1).length;
}
reader.close();
return pageSizes;
}
But it doesn't work properly. The reader.getPageContent(i+1).length; instruction returns very small values (<= 100 usually), even for large pages that are more than 1MB, so clearly this is not the correct way to do this.
But what IS the correct way? Is there one?
Note: I've already checked this question, but the offered solution consists of writing each page of the PDF to disk and then checking the file size, which is extremely inefficient and may even be wrong, since I'm assuming this would repeat the PDF header and metadata each time. I was searching for a more "proper" solution.

Well, in the end I managed to get hold of the source code for the original program that I was working with, which only accepted PDFs as input with a maximum "page size" of 1MB. Turns out... what it actually means by "page size" was fileSize / pageCount -_-^
For anyone that actually needs the precise size of a "standalone" page, with all content included, I've tested this solution and it seems to work well, tho it probably isn't very efficient as it writes out an entire PDF document for each page. Using a memory stream instead of a disk-based one helps, but I don't know how much.
public static int[] getPageSizes(byte[] input) throws IOException {
PdfReader reader;
reader = new PdfReader(input);
int pageCount = reader.getNumberOfPages();
int[] pageSizes = new int[pageCount];
for (int i = 0; i < pageCount; i++) {
try {
Document doc = new Document();
ByteArrayOutputStream bous = new ByteArrayOutputStream();
PdfCopy copy= new PdfCopy(doc, bous);
doc.open();
PdfImportedPage page = copy.getImportedPage(reader, i+1);
copy.addPage(page);
doc.close();
pageSizes[i] = bous.size();
} catch (DocumentException e) {
e.printStackTrace();
}
}
reader.close();
return pageSizes;
}

"Random" generated documents fail to print

We are attempting to generate documents using iText that are formed largely from "template" files - smaller PDF files that are combined together into one composite file using the PdfContentByte.addTemplate method. We then automatically and silently print the new file using the *nix command lp. This usually works; however occasionally, some files that are generated will fail to print. The document proceeds through all queues and arrives at the printer proper (a Lexmark T652n, in this case), its physical display gives a message of pending progress, and even its mechanical components whir up in preparation - then, the printing job vanishes spontaneously without a trace, and the printer returns to being ready.
The oddity in how specific this issue tends to be. For starters, the files in question print without fail when done manually through Adobe PDF Viewer, and can be read fine by editors like Adobe Live Cycle. Furthermore, the content of the file effects whether it is plagued by this issue, but not in a clear way - adding a specific template 20 times could cause the problem, while doing it 19 or 21 times might be fine, or using a different template will change the pattern entirely and might cause it to happen instead after 37 times. Generating a document with the exact same content will be consistent on whether or not the issue occurs, but any subtle and seemingly irrelevant change in content will change whether the problem happens.
While it could be considered a hardware issue, the fact remains that certain iText-generated files have this issue while others do not. Is our method of file creation sometimes creating files that are somehow considered corrupt only to the printer and only sometimes?
Here is a relatively small code example that generates documents using the repetitive template method similar to our main program. It uses this file as a template and repeats it a specified number of times.
public class PDFFileMaker {
private static final int INCH = 72;
final private static float MARGIN_TOP = INCH / 4;
final private static float MARGIN_BOTTOM = INCH / 2;
private static final String DIREC = "/pdftest/";
private static final String OUTPUT_FILEPATH = DIREC + "cooldoc_%d.pdf";
private static final String TEMPLATE1_FILEPATH = DIREC + "template1.pdf";
private static final Rectangle PAGE_SIZE = PageSize.LETTER;
private static final Rectangle TEMPLATE_SIZE = PageSize.LETTER;
private ByteArrayOutputStream workingBuffer;
private ByteArrayOutputStream storageBuffer;
private ByteArrayOutputStream templateBuffer;
private float currPosition;
private int currPage;
private int formFillCount;
private int templateTotal;
private static final int DEFAULT_NUMBER_OF_TIMES = 23;
public static void main (String [] args) {
System.out.println("Starting...");
PDFFileMaker maker = new PDFFileMaker();
File file = null;
try {
file = maker.createPDF(DEFAULT_NUMBER_OF_TIMES);
}
catch (Exception e) {
e.printStackTrace();
}
if (file == null || !file.exists()) {
System.out.println("File failed to be created.");
}
else {
System.out.println("File creation successful.");
}
}
public File createPDF(int inCount) throws Exception {
templateTotal = inCount;
String sFilepath = String.format(OUTPUT_FILEPATH, templateTotal);
workingBuffer = new ByteArrayOutputStream();
storageBuffer = new ByteArrayOutputStream();
templateBuffer = new ByteArrayOutputStream();
startPDF();
doMainSegment();
finishPDF(sFilepath);
return new File(sFilepath);
}
private void startPDF() throws DocumentException, FileNotFoundException {
Document d = new Document(PAGE_SIZE);
PdfWriter w = PdfWriter.getInstance(d, workingBuffer);
d.open();
d.add(new Paragraph(" "));
d.close();
w.close();
currPosition = 0;
currPage = 1;
formFillCount = 1;
}
protected void finishPDF(String sFilepath) throws DocumentException, IOException {
//Transfers data from buffer 1 to builder file
PdfReader r = new PdfReader(workingBuffer.toByteArray());
PdfStamper s = new PdfStamper(r, new FileOutputStream(sFilepath));
s.setFullCompression();
r.close();
s.close();
}
private void doMainSegment() throws FileNotFoundException, IOException, DocumentException {
File fTemplate1 = new File(TEMPLATE1_FILEPATH);
for (int i = 0; i < templateTotal; i++) {
doTemplate(fTemplate1);
}
}
private void doTemplate(File f) throws FileNotFoundException, IOException, DocumentException {
PdfReader reader = new PdfReader(new FileInputStream(f));
//Transfers data from the template input file to temporary buffer
PdfStamper stamper = new PdfStamper(reader, templateBuffer);
stamper.setFormFlattening(true);
AcroFields form = stamper.getAcroFields();
//Get size of template file via looking for "end" Acrofield
float[] area = form.getFieldPositions("end");
float size = TEMPLATE_SIZE.getHeight() - MARGIN_TOP - area[4];
//Requires Page Break
if (size >= PAGE_SIZE.getHeight() - MARGIN_TOP - MARGIN_BOTTOM + currPosition) {
PdfReader subreader = new PdfReader(workingBuffer.toByteArray());
PdfStamper substamper = new PdfStamper(subreader, storageBuffer);
currPosition = 0;
currPage++;
substamper.insertPage(currPage, PAGE_SIZE);
substamper.close();
subreader.close();
workingBuffer = storageBuffer;
storageBuffer = new ByteArrayOutputStream();
}
//Set Fields
form.setField("field1", String.format("Form Text %d", formFillCount));
form.setField("page", String.format("Page %d", currPage));
formFillCount++;
stamper.close();
reader.close();
//Read from working buffer, stamp to storage buffer, stamp template from template buffer
reader = new PdfReader(workingBuffer.toByteArray());
stamper = new PdfStamper(reader, storageBuffer);
reader.close();
reader = new PdfReader(templateBuffer.toByteArray());
PdfImportedPage page = stamper.getImportedPage(reader, 1);
PdfContentByte cb = stamper.getOverContent(currPage);
cb.addTemplate(page, 0, currPosition);
stamper.close();
reader.close();
//Reset buffers - working buffer takes on storage buffer data, storage and template buffers clear
workingBuffer = storageBuffer;
storageBuffer = new ByteArrayOutputStream();
templateBuffer = new ByteArrayOutputStream();
currPosition -= size;
}
Running this program with a DEFAULT_NUMBER_OF_TIMES of 23 produces this document and causes the failure when sent to the printer. Changing it to 22 times produces this similar-looking document (simply with one less "line") which does not have the problem and prints successfully. Using a different PDF file as a template component completely changes these numbers or makes it so that it may not happen at all.
While this problem is likely too specific and with too many factors for other people to reasonably be expected to reproduce, the question of possibilities remains. What about the file generation could cause this unusual behavior? What might cause one file to be acceptable to a specific printer but another, generated in the same manner in different only in seemingly non-trivial ways, to be unacceptable? Is there a bug in iText produced by using the stamper template commands too heavily? This has been a long-running bug with us for a while now, so any assistance is appreciate; additionally, I am willing to answer questions or have extended conversations in chat as necessary in an effort to get to the bottom of this.

The design of your application more or less abuses the otherwise perfectly fine PdfStamper functionality.
Allow me to explain.
The contents of a page can be expressed as a stream object or as an array of a stream objects. When changing a page using PdfStamper, the content of this page is always an array of stream objects, consisting of the original stream object or the original array of stream objects to which extra elements are added.
By adding the same template creating a PdfStamper object over and over again, you increase the number of elements in the page contents array dramatically. You also introduce a huge number of q and Q operators that save and restore the stack. The reason why you have random behavior is clear: the memory and CPU available to process the PDF can vary from one moment to another. One time, there will be sufficient resources to deal with 20 q operators (saves the state), the next time there will only be sufficient resources to deal with 19. The problem occurs when the process runs out of resources.
While the PDFs you're creating aren't illegal according to ISO-32000-1, some PDF processors simply choke on these PDFs. iText is a toolbox that allows you to create PDFs that can make me very happy when I look under the hood, but it also allows you to create horrible PDFs if you don't use the toolbox wisely. The latter is what happened in your case.
You should solve this be reusing the PdfStamper instance instead of creating a new PdfStamper over and over again. If that's not possible, please post another question, using less words, explaining exactly what you want to achieve.
Suppose that you have many different source files with PDF snippets that need to be added to a single page. For instance: suppose that each PDF snippet was a coupon and you need to create a sheet with 30 coupons. Than you'd use a single PdfWriter instance, import pages with getImportedPage() and add them at the correct position using addTemplate().
Of course: I have no idea what your project is about. The idea of coupons of a page was inspired by your test PDF.

Sift features from Lire library

I am trying to find a sift implementation for lire library. The only thing I found is the above link feature. I am trying to understand what I ve got to use in order to extract sift feaures for an image.
Any idea what I ve got to do here?
I am trying something like:
Extractor e = new Extractor();
File img = new File("im.jpg");
BufferedImage in = ImageIO.read(img);
BufferedImage newImage = new BufferedImage(in.getWidth(),
in.getHeight(), BufferedImage.TYPE_BYTE_GRAY);
List<Feature> fs1 = e.computeSiftFeatures(newImage);
System.out.println(fs1);
But I ve got an empty list.

//Here is the revised answer for you it may help
public class indexing {
String directory="your_image_dataset";
String index="./images__idex";//where you will put the index
/* if you want to use BOVW based searching you can change the
numbers below but be careful */
int numClusters = 2000; // number of visual words
int numDocForVocabulary = 200;
/* number of samples used for visual words vocabulary building
this function calls the document builder and indexer function (indexFiles below)
for each image in the data set */
public void IndexImage() throws IOException{
System.out.println("-< Getting files to index >--------------");
List<String> images = FileUtils.getAllImages(new File(directory), true);
System.out.println("-< Indexing " + images.size() + " files >--------------");
indexFiles(images, index);
}
/* this function builds Lucene document for each image passed to it for
the extracted visual descriptors */
private void indexFiles(List<String> images, String index)
throws FileNotFoundException, IOException {
//first high level structure
ChainedDocumentBuilder documentBuilder = new ChainedDocumentBuilder();
//type of document to be created here i included different types of visual features,
//documentBuilder.addBuilder(new SurfDocumentBuilder());
//here choose either Surf or SIFT
documentBuilder.addBuilder(new SiftDocumentBuilder());
documentBuilder.addBuilder(DocumentBuilderFactory.getEdgeHistogramBuilder());
documentBuilder.addBuilder(DocumentBuilderFactory.getJCDDocumentBuilder());
documentBuilder.addBuilder(DocumentBuilderFactory.getColorLayoutBuilder());
//IndexWriter creates the file for index storage
IndexWriter iw = LuceneUtils.createIndexWriter(index, true);
int count = 0;
/*then each image in data set called up on the created document structure
(documentBuilder above and added to the index file by constructing the defined
document structure) */
for (String identifier : images) {
Document doc = documentBuilder.createDocument(new
FileInputStream(identifier), identifier);
iw.addDocument(doc);//adding document to index
iw.close();// closing the index writer
/* For searching purpose you will read the index and by constructing an instance of
IndexReader she you defined different searching strategy which is available in Lire
Please check the brace and test it. */

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.