iText java: streaming of a concatenated pdf - java

So I am generating a 500+ page document by concatenating 250 individual pdfs of 2 pages each. Each pdf is a form which has fields that data is placed into.
I would like to stream content to the browser two pages at a time.
Currently I am waiting for the entire document to be generated before sending anything to the browser and this is causing some timeout problems.
This is a simplified form of what I am doing:
PdfCopyFields pdfCopier = new PdfCopyFields(outputStream);
for(Receipt receipt: receipts) {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
generator.generateSaleDocument(receipt.getId(), baos);
pdfCopier.addDocument(new PdfReader(baos.toByteArray()));
}
pdfCopier.close();
outputStream.flush();
How do I stream these PDFs instead of sending them as one big bunch?

Related

GSON / iText: Extract Text From PDF 1.7 byte[]

I'm automating tests using Rest-Assured and GSON - and need to validate the contents of a PDF file that is returned in the response of a POST request. The content of the files vary and can contain anything from just text, to text and tables, or text and tables and graphics. Every page can, and most likely will be different as far a glyph content. I am only concerned with ALL text on the pdf page - be it just plain text, or text inside of a table, or text associated with (or is inside of) an image. Since all pdf's returned by the request are different, I cannot define search areas (as far as I know). I just need to extract all text on the page.
I extract the pdf data into a byte array like so:
Gson pdfGson = new Gson();
byte[] pdfBytes =
pdfGson.fromJson(this.response.as(JsonObject.class)
.get("pdfData").getAsJsonObject().get("data").getAsJsonArray(), byte[].class);
(I've tried other extraction methods for the byte[], but this is the only way I've found that returns valid data.) This returns a very large byte[] like so:
[37, 91, 22, 45, 23, ...]
When I parse the array I run into the same issue as This Question (except my pdf is 1.7) and I attempt to implement the accepted answer, adjusted for my purposes and as explained in the documentation for iText:
byte[] decodedPdfBytes = PdfReader.decodeBytes(pdfBytes, new PdfDictionary(), FilterHandlers.getDefaultFilterHandlers());
IRandomAccessSource source = new RandomAccessSourceFactory().createSource(decodedPdfBytes);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ReaderProperties readerProperties = new ReaderProperties();
// Ineffective:
readerProperties.setPassword(user.password.getBytes());
PdfReader pdfReader = new PdfReader(source, readerProperties);
// Ineffective:
pdfReader.setUnethicalReading(true);
PdfDocument pdfDoc = new PdfDocument(pdfReader, new PdfWriter(baos));
for(int i = 1; i < pdfDoc.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(pdfDoc.getPage(i));
System.out.println(text);
}
This DOES decode the pdf page, and return text, but it is only the header text. No other text is returned.
For what it's worth, on the front end, when the user clicks the button to generate the pdf, it returns a blob containing the download data, so I'm relatively sure that the metadata is GSA encoded, but I'm not sure if that matters at all. I'm not able to share an example of the pdf docs due to sensitive material.
Any point in the right direction would be greatly appreciated! I've spent 3 days trying to find a solution.
For those looking for a solution - ultimately we wound up going a different route. We never found a solution to this specific issue.

PdfConcatenate (smart) in iText 7

In iText v5 there was the option to make 'smart' concatenation of PDF documents:
public PdfConcatenate(OutputStream os, boolean smart) throws DocumentException
Creates an instance of the concatenation class.
Parameters:
os - the OutputStream for the PDF document
smart - do we want PdfCopy to detect redundant content?
The initialization I was doing would be something like:
PdfConcatenate concatenatedPdf = new PdfConcatenate(outputStream, true);
In iText 7 I read we should use the copyPages function. Something like:
[...]
PdfDocument concatenatedPdf = new PdfDocument(writer);
PdfDocument docToAdd = new PdfDocument(pdfReader);
docToAdd.copyPagesTo(1, docToAdd.getNumberOfPages(), concatenatedPdf);
I'm migrating a logic to merge documents from iText v5 to v7. For a sample test in v5 with PdfConcatenate and the flag 'smart' the result PDF is 177 KB, in v7 is 763 KB. Is there a way to detect this redundant content in iText v7?
First of all, iText7 provides a convenient class called PdfMerger for merging PDFs.
Here is a sample how to use it:
PdfDocument sourceDocument = new PdfDocument(new PdfReader(filename));
PdfMerger resultDocument = new PdfMerger(new PdfDocument(new PdfWriter(resultFile)));
resultDocument.merge(sourceDocument, fromPage, toPage);
resultDocument.close();
sourceDocument.close();
Of course in the example only one set of pages from source document is added to the resultant document but you can call merge function as many times as you like.
Now when you want the resultant file to be as small in terms of file size as possible, you need to specify some settings for the destination PdfDocument that you feed to the PdfMerger.
First of all, you can tweak the compression level for streams to use more CPU and time but compress better:
PdfMerger resultDocument = new PdfMerger(new PdfDocument(
new PdfWriter(resultFile, new WriterProperties().setCompressionLevel(CompressionConstants.BEST_COMPRESSION))));
To compress even better you can use full compression. That would not only better compress streams (content of the page, images, fonts), but also would compress PDF objects that usually consume many bits in the out file size. This can be done like this:
PdfMerger resultDocument = new PdfMerger(new PdfDocument(
new PdfWriter(resultFile, new WriterProperties().setFullCompressionMode(true))));
In case source documents have same objects by default you might have some duplication. So-called "Smart Mode" provides a possibility to avoid such duplication and optimize the file size for cases when there are many duplicating objects. This would be the closes analogue to the "smart" flag you refer to in your iText 5 code. You can enable smart mode in iText 7 in the following way:
PdfMerger resultDocument = new PdfMerger(new PdfDocument(
new PdfWriter(resultFile, new WriterProperties().useSmartMode())));

How to create an input stream from a list of byte[] where each byte[] represent one page of a document (say pdf)

From my Java code I am calling a Document Service to get a document from it based on some input parameters. The mime type of the document could be pdf, text or image. The service response is in the form of List<PageContent> where each PageContent has byte[] which represent one page of the document. Now I want to create an input stream for the whole document.
So I thought of collating all the pages like following:
List<PageContent> pages = ....//Response from the service
ByteArrayOutputStream bos = new ByteArrayOutputStream();
for(PageContent page : pages) {
byte[] data = page.getData();
bos.write(data);
}
InputStream is = new ByteArrayInputStream(bos.toByteArray());
I still have doubt that it is the right way or not. Should the logic of collating all page to form a single input stream be different based on mime type?
What kind of details do I need to get from Document Service provider which will help me do this?

how sohronit page pdf file in byte byte [] and restore back

I need to parse a PDF file through the pages and load each separately into a byte[]. I use the itext library.
I download a file consisting of one page with this code:
public Document addPageInTheDocument(String namePage, MultipartFile pdfData, Long documentId) throws IOException {
notNull(namePage, INVALID_PARAMETRE);
notNull(pdfData, INVALID_PARAMETRE);
notNull(documentId, INVALID_PARAMETRE);
byte[] in = pdfData.getBytes(); // size file 88747
Page page = new Page(namePage);
Document document = new Document();
document.setId(documentId);
PdfReader reader = new PdfReader(new ByteArrayInputStream(pdfData.getBytes()));
PdfDocument pdfDocument = new PdfDocument(reader);
if (pdfDocument.getNumberOfPages() != 1) {
throw new IllegalArgumentException();
}
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
page.setPageData(pdfDocument.getPage(1).getContentBytes());
return addPageInTheDocument(document, page);
}
I'm trying to restore the file with this code:
ByteBuffer byteContent = new ByteBuffer() ;
for (Map.Entry<String, Page> page : pages.entrySet()) {
byteContent.append(page.getValue().getPageData());
}
PdfWriter writer = new PdfWriter(new FileOutputStream(book.getName() + modification + FORMAT));
byte[] df = byteContent.toByteArray();
PdfReader reader = new PdfReader(new ByteArrayInputStream(byteContent.toByteArray()));
com.itextpdf.layout.Document itextDocument = new com.itextpdf.layout.Document(new PdfDocument(reader, writer));
itextDocument.close();
Why is there such a difference in size?
And why the files and pages, and both the byte[] to create the file?
Let's start with your size question:
byte[] in = pdfData.getBytes(); // size file 88747
...
byte[] transform = pdfDocument.getPage(1).getContentBytes(); // 1907 size page
...
Why are there such a difference in size?
Because PdfPage.getContentBytes() does not return what you expect.
You seem to expect it to return a complete representation of the contents of the given page, and the Javadocs of that method might be interpreted ("Get decoded bytes for the whole page content.") to mean that.
This is not the case. PdfPage.getContentBytes() returns the contents of the content stream(s) of the page. These content stream(s) contain a sequence of commands which build the page. But these commands take parameters which reference data outside the content stream, e.g.:
when text is drawn on a PDF page, the content stream contains an operation selecting a font but the data describing the font and in case of embedded fonts the font program itself are outside the content stream;
when bitmap images are drawn, the content stream usually contains an operation for it which references image data outside the content stream;
there are operations which reference so called xobjects which essentially are independent content streams which can be called upon from any page; these xobject are not contained in the page content stream either.
Furthermore there are annotations (e.g. form fields) with their own content streams which are stored in separate structures. And lots of page properties are outside, too.
Thus, there are such differences in size because you get only a minute part of the page definition using getContentBytes.
Now let's look at your code "restoring the file".
As a corollary of the above it is obvious that your code merely concatenates some content streams but does not provide the external resources these streams refer to.
But aside from that your code also points out a misunderstanding concerning the nature of PDF pages: they are not merely blobs you can split and concatenate again as you want. They are collections of PDF objects which are spread throughout the PDF file; different pages can share some of their objects (e.g. fonts of often used images).
What you can do instead...
As representations of a single page you should use a PDF containing the data referenced by that single page. The iText example Burst.java shows how to do that.
To join these single page PDFs again you can use an iText PdfMerger. Remember to set smart mode (PdfWriter.setSmartMode(true)) to prevent resource duplication in the result.

Add HTML to iText document in memory using XHTMLrenderer (FlyingSaucer)

I am using iText 2.1.7 to generate a document from a database. One of the fields I need to add is in XHTML format. I can use the HTMLWorker class to generate the HTML but this is a bit limited.
I convert this to XHTML using the following code:
String url = chapterDesc.getString("description").toString(); // get the HTML string from the database
org.w3c.dom.Document doc = XMLResource.load(new ByteArrayInputStream(url.getBytes())).getDocument();
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(doc, null);
ByteArrayOutputStream os = new ByteArrayOutputStream();
renderer.layout();
renderer.createPDF(os);
I want to add this information to the document in memory. Is this possible?
Do I need to use PdfStamper? I believe that this requires the document to be closed? If it is possible I would like to avoid using multiple passes to add these descriptions.
Flying saucer does not work correctly with any version of iText other than 2.0.8. Also since you meantioned creating the pdf in memory are you using JSF, JSP, or servlets? If you are than you can just send your ByteArrayOutputStream as a response on one of these pages using something along the lines of
response.setContentType("application/pdf");
response.setContentLength(os.size());
os.writeTo(response.getOutputStream());
response.flushBuffer();
I know it's been more than two years since you've asked, but I'm facing the same problem. I googled for a solution and apparently there is none anywhere to be found. So I had to develop my own and I thought I might as well share it. Hope it'll be useful to someone.
I tried to use flying saucer as you did, but it didn't work for me. My piece of HTML was just a simple table so I could use iText HTMLWorker to do the parsing.
So first I get a PdfStamper as you suggested.
PdfReader template = new PdfReader(templateFileName);
PdfStamper editablePage = new PdfStamper(template, reportOutStream);
Then I work with the document (fill the fields, insert some images) and after that I need to insert an HTML snippet.
//getting a 'canvas' to add parsed elements
final ColumnText page = new ColumnText(editablePage.getOverContent(pageNumber));
//finding out the page sizefinal
Rectangle pagesize = editablePage.getReader().getPageSize(pageNumber);
//you can define any size here, that will be where your parsed elements will be added
page.setSimpleColumn(0, 0, pagesize.getWidth(), pagesize.getHeight());
If you need simple styling, HTMLWorker can do some
StyleSheet styles = new StyleSheet();
styles.loadStyle("h1", "color", "#008080");
//parsing
List<Element> parsedTags = HTMLWorker.parseToList(new StringReader(htmlSnippet), styles);
for (Element tag : parsedTags)
{
page.addElement(tag);
page.go();
}
These are just some basic ideas of how to do that, hope it helps.

Categories