Caching of basic XWPFDocument templates and reusing them for document generation

Caching of basic XWPFDocument templates and reusing them for document generation - java

We are implementing a portal handling requests for modifying and generating Microsoft Office 2007 documents (docx).The back-end is implemented in Java using Apache POI as the API of manipulating the contents of the docx files. The back-end is accessed through RestAPI calls coming from a front-end written in JavaScript.
The back-end acts like a Document Server that handles about 15 different docx documents which act as templates and contain tokens that need to be replaced with actual values. The requests coming from the front-end are actually a token value map that the back-end needs to replace in the templates and generate a new document, for each request. The workflow is as follows:
receive request from front-end: token-value map
read template document as XWPFDocument object
parse and replace text in all XWPFParagraph/XWPFTable elements of the XWPFDocument
write the modified XWPFDocument to a different file path
I am trying to implement a caching mechanisms at the moment, it is a real performance issue going to the disk and reading the files for each request. I would need to treat each template document as a Prototype and return a clone for each request that the back-end receives, something similar to this:
XWPFDocument theDocument = documentCache.clone(documentConfiguration.getInputType());
The clone method is currently implemented as follows:
public XWPFDocument clone(DocumentDictionary.DocumentType type){
if(PACKAGE_MAP.isEmpty())
getPackages();
XWPFDocument document = null;
try {
document = new XWPFDocument(PACKAGE_MAP.get(type));
}catch(IOException exception){
logger.error("Unable to clone document for input type {}", type);
}
return document;
}
This implementation does not yield the desired results, the first request processing works as expected, but the second request fails when writting the document with the error:
Caused by: org.xml.sax.SAXParseException: The processing instruction target matching "[xX][mM][lL]" is not allowed.
The exception above does not replicate in the case of reading the document fresh at each request.
Looking at the Apache POI API, the clone() methods for XWPFDocument and ZipPackage, used in the reading/writting process are protected, so I cannot use the basic functionality offered by the programming language and the issues seems to come from the fact that the ZipPackage is shared and used in both the reading/writting of the document.
Has anyone been able to implement such a mechanism using Java and Apache POI?

You can pre-load the byte[] for each template type and then as base for clone with ByteArrayInputStream
You can also instantiate the template only when requested, so the getPackages() became getPackage(DocumentDictionary.DocumentType type) and check for single type in the foreach
The XWPFDocument is written on runtime, so when you make modification to its paragraphs, tables or runs in general, you are editing the template document, so you have to reload it in other ways.
private void getPackages() {
for (DocumentDictionary.DocumentType type : DocumentDictionary.DocumentType.values()) {
PACKAGE_MAP.put(type, FileUtils.readFileToByteArray(new File(getTemplateFromType(type))));
}
}
private String getTemplateFromType(DocumentDictionary.DocumentType type) {
switch(type) {
case TYPE_1:
return "/path/to/template/type_1.docx";
...
}
}
public XWPFDocument clone(DocumentDictionary.DocumentType type) {
if(PACKAGE_MAP.isEmpty())
getPackages();
XWPFDocument document = null;
try {
document = new XWPFDocument(new ByteArrayInputStream(PACKAGE_MAP.get(type)));
} catch(IOException exception) {
logger.error("Unable to clone document for input type {}", type);
}
return document;
ByteArrayInputStream();
}

Related

How to read file in compilation time using Java?

I have a project which consists of reading 1000 XML files, each defining a rule of processing the different types of data. The consequence is that the application takes a few seconds to load the XML files when it starts. It's an Android mobile app so the CPU isn't very powerful.
Is there a way to create static objects at compilation time by reading these XML files? If I can pre-process the XML by defining static objects which already have the XML read into it, the app should be able to start loaded, a lot faster. The draw-back that the XML file can't change in the runtime is acceptable.
This is a generic question - I am not bound to use any specific method or library. Anything that allows me to pre-parse the XML will do. But as comments asked for my current runtime-parsing implementation, I provide it in the following paragraphs which uses the DOM parser shipped with Java.
The current implementation:
The XML processing class simply creates an object by reads each XML file. It is used like this:
lst.add(XMLData(new FileInputStream(new File("assets/001.xml"))));
lst.add(XMLData(new FileInputStream(new File("assets/002.xml"))));
....
Where XMLData is the object that reads the XML file and keeps the relevant information. lst is a List of such objects.
The XMLData class look like this:
class XMLDAta {
public XMLData(InputStream xml) throws IOException, SAXException {
DocumentBuilder dBuilder;
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dBuilder = dbFactory.newDocumentBuilder();
} catch (ParserConfigurationException e) {
// TODO: if schema has problems (e.g. defined twice).
// all XML well-formedness were checked before shipping them
e.printStackTrace(); // shouldn't happen
return;
}
Document xml = dBuilder.parse(xmlAsset);

Count pages/lines in a word file that was edited with docx4j

I found some posts here how to count pages/lines with the apache-poi library.
But my code already uses docx4j right now, it would be too much work to completely replace that.
Therefore my question is, how can I get from an object of type WordprocessingMLPackage to an object of type XWPFDocument in order to count the lines and pages of my current document.
private XWPFDocument convertDocx4J(WordprocessingMLPackage wp) {
XWPFDocument oiDoc = null;
//TODO...
return oiDoc;
}

Easiest way to go from docx4j's WordprocessingMLPackage to POI would be to use docx4j's API to save as docx, then POI's to load.
But you can get page info from docx4j; see https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/toc/TocGenerator.java#L657

getting exception while redacting pdf using itext

I am getting below exception while trying to redact pdf document using itext.
The issue is very sporadic like sometime it is working and sometimes it is throwing error.
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:60)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:991)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpContentOperator.invoke(PdfCleanUpContentOperator.java:140)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:286)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:425)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUpPage(PdfCleanUpProcessor.java:160)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUp(PdfCleanUpProcessor.java:135)
at RedactionClass.tgestRedactJavishsInput(RedactionClass.java:56)
at RedactionClass.main(RedactionClass.java:23)
Code which i am using to redact is below:
public static void testRedact() throws IOException, DocumentException {
InputStream resource = new FileInputStream("D:/itext/edited_120192824_5 (1).pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR,
"aviteshs.pdf"));
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
int pageCount = reader.getNumberOfPages();
Rectangle linkLocation1 = new Rectangle(440f, 700f, 470f, 710f);
Rectangle linkLocation2 = new Rectangle(308f, 205f, 338f, 215f);
Rectangle linkLocation3 = new Rectangle(90f, 155f, 130f, 165f);
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
for (int currentPage = 1; currentPage <= pageCount; currentPage++) {
if (currentPage == 1) {
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation1, BaseColor.BLACK));
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation2, BaseColor.BLACK));
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation3, BaseColor.BLACK));
} else {
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation1, BaseColor.BLACK));
}
}
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,
stamper);
try {
cleaner.cleanUp();
} catch (Exception e) {
e.printStackTrace();
}
stamper.close();
reader.close();
}
Due to customer document i am unable to share it , trying to find out some test data for same.
Please find the doc here:
https://drive.google.com/file/d/0B-zalNTEeIOwM1JJVWctcW8ydU0/view?usp=drivesdk

In short: The cause of the NullPointerException here is that iText does not support form XObject resource inheritance from the page they are displayed on. According to the PDF specification this construct is obsolete but it can be encountered in PDFs obeying early PDF references instead of the specification.
The cause
Page 1 of the document in question contains 4 XObject resources named I1, M0, P1, and Q0:
As you can see in the screenshot, Q0 in particular has no own Resources dictionary. But its last instructions are
q
413 0 0 125 75 3086 cm
/I1 Do
Q
Id est it references a resource I1.
Now iText in case of form XObjects assumes that the resources their contents reference are contained in their own Resources dictionary.
The result: iText accesses a null dictionary and a NullPointerException occurs.
The specification
The PDF specification ISO 32000-1 specifies:
A resource dictionary shall be associated with a content stream in one of the following ways:
For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.
For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.
PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.
(ISO 32000-1, section 7.8.3 - Resource Dictionaries)
Thus, in the case at hand we are in the situation of the obsolete option three, Q0 references the XObject I1 defined in the resource dictionary of the page Q0 is used for.
The document in question has a version header claiming PDF 1.5 conformance (in contrast to PDF 1.7 of the PDF specification). So let's look at the PDF Reference 1.5. The paragraph there corresponding to option three is:
A form XObject or a Type 3 font’s glyph description may omit the Resources
entry, in which case resources will be looked up in the Resources entry of the
page on which the form or font is used. This practice is not recommended.
Summarized, therefore, the PDF in question uses a construct which the PDF specification (published in 2008, in use for nine years!) calls obsolete and even the PDF Reference the file claims conformance to recommends against. iText, on the other hand, does not support this obsolete construct.
Ideas how to fix this
Essentially the PDF Cleanup code must be extended to
remember the resources of the current page in the PdfCleanUpProcessor and
use these current page resources in the PdfCleanUpContentOperator method invoke in case of a Do operator referring to form XObject without own resources.
Unfortunately some members used in invoke are private. Thus, one has to either copy the PdfCleanUp code or fall back on reflection.
(iText 5.5.12-SNAPSHOT)
iText 7
The iText 7 PDF CleanUp tool also runs into an issue for your PDF, here the exception is a IllegalStateException claiming "Graphics state is always deleted after event dispatching. If you want to preserve it in renderer info, use preserveGraphicsState method after receiving renderer info."
As this exception is thrown during event dispatching, this error message does not make sense. Unfortunately the PDF CleanUp tool has become closed source in iText 7, so it is not so easy pinpointing the issue.
(iText 7.0.3-SNAPSHOT; PDF CleanUp 1.0.2-SNAPSHOT)

iText: PDF Generation. One Template. More Inputs. One Output

i try to generate a pdf with itext. First i read in a existing template and stamp the formulars in the method stampFormular(Formular formular, PdfStamper stamper). The stamp method works. But i have a problem, with adding more formulars to the output file.
I want to stamp for each Formular the PDF Template "yellow". So i tried it with, the document.add(), but that doesn't work. So i tried to do this with pdf writer. But that doesn't work to. Any idea how i can stamp the pdf template with the one formular data, make a new page and stamp the same pdf template with the next formular data.
public static File createForm(List<Fomular> formulars) {
Document document = new Document();
File pdf = null;
document.open();
try {
PdfReader pdfTemplate = new PdfReader('YELLOW');
PdfStamper stamper = new PdfStamper(pdfTemplate,
new FileOutputStream("output.pdf"));
PdfWriter writer;
for (Formular f : formulars) {
stamper = stampFormular(f, stamper);
writer = stamper.getWriter();
writer.newPage();
}
stamper.close();
pdfTemplate.close();
pdf = new File("output.pdf");
Desktop.getDesktop().open(pdf);
} catch (DocumentException | IOException e) {
e.printStackTrace();
}
return pdf;
}

A couple of observations:
You can't take the PdfWriter object from a PdfStamper, use newPage() and expect it to work. That's the equivalent of opening the hood of your car and start rewiring tubes that fit without knowing anything about the art of motor maintenance. When you want to add a new page to a stamper, you're supposed to use the insertPage() method as explained in the documentation.
Second observation: you're not telling us if you're flattening the content of the forms. If you do, then it's simple, just use the example mentioned in the documentation and you're all set. In other words: combine PdfStamper with PdfSmartCopy. Especially if you're using the same template over and over again, PdfSmartCopy will give you much better results than PdfCopy for the reason explained in chapter 6.
Suppose that your template needs to remain interactive, then you may have a problem for a reason that is also explained in that chapter: different visualizations of a field with a specific name must always have the same value. For instance: if your template has a field named name, then every occurrence of this field in the document must have the same value. If you don't want this, you need to rename name, for instance to name1, name2, etc...
Concatenation of templates that need to remain interactive used to be done with PdfCopyFields (see documentation). Here, the documentation is somewhat outdated. In the latest version of iText, we now have a method addDocument() in PdfCopy and PdfSmartCopy. This method allows you to add a full document at once, preserving the interactivity.

Generating a PDF using a PUT operation

I have a web application that can display a generated PDF file to the user using the following Java code on the server:
#Path("MyDocument.pdf/")
#GET
#Produces({"application/pdf"})
public StreamingOutput getPDF() throws Exception {
return new StreamingOutput() {
public void write(OutputStream output) throws IOException, WebApplicationException {
try {
PdfGenerator generator = new PdfGenerator(getEntity());
generator.generatePDF(output);
} catch (Exception e) {
logger.error("Error getting PDF file.", e);
throw new WebApplicationException(e);
}
}
};
}
This code takes advantage of the fact that I only need so much data from the front end in order to generate the PDF, so it can easily be done using a GET function.
However, I now want to return a PDF in a more dynamic way, and need a bunch more information from the front end in order to generate the PDF. In other areas, I'm sending similar amounts of data and persisting it to the data store using a PUT and #FormParams, such as:
#PUT
#Consumes({"application/x-www-form-urlencoded"})
public void put(#FormParam("name") String name,
#FormParam("details") String details,
#FormParam("moreDetails") String moreDetails...
So, because of the amount of data I need to pass from the front end, I can't use a GET function with just query parameters.
I'm using Dojo on the front-end, and all of the dojo interactions really don't know what to do with a PDF returned from a PUT operation.
I'd like to not have to do this in two steps (persist the data sent in the put, and then request the PDF) simply because the PDF is more "transient" in this uses case, and I don't want the data taking up space in my data store.
Is there a way to do this, or am I thinking about things all wrong?
Thanks.

I can't quite understand what do you need to accomplish - looks like you want to submit some data to persist it and then return pdf as a result? This should be straightforward, doesn't need to be 2 steps, just submit, on the submit save the data and return PDF.
Is this your problem? Can you clarify?
P.S.
Ok, you need to do the following in your servlet:
response.setHeader("Content-disposition",
"attachment; filename=" +
"Example.pdf" );
response.setContentType( "application/pdf" );
Set the "content-length" on the response, otherwise the Acrobat Reader plugin may not work properly, ex. response.setContentLength(bos.size());
If you provide output in JSP you can do this:
<%# page contentType="application/pdf" %>

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.