PdfBox flatten pdf does not remove acroform elements - java

I have a pdf with a lot of acroforms, I do some manipulation on it which results in a new pdf.
So I have PDF-1 (which is the original one )and PDF-2 (just a duplication of PDF-1), now I want to merge them. Both PDFs have some acroforms for example: field_a, field_2...
Before I merge them I flatten PDF-1, because I only want to have acrofields from PDF-2. When I check then my new merged PDF I can see that there are no visible fields on on the pages from PDF-1 and there are fields on pages of fields of PDF-2. At the first look it seems ok, but when I inspect the fields I can see that the merger has renamed all the fields for PDF-2 e.g. field_a_dummy123, field_b_dummy232 ...
It seems to me, that flattening does not remove the fields and thats why the PDFMerger from PDFBox will rename the fields for PDF-2 because acrofields need to be unique. Is there a way to completely remove the acroforms of PDF-1?
#Test
public void flattenAndMerge() throws IOException {
File testForm = new File(classLoader.getResource("./TestForm.pdf").getFile());
byte[] testFormAsByte = Files.readAllBytes(testForm.toPath());
byte[] testFormAsByte2 = Files.readAllBytes(testForm.toPath());
PDDocument pdf1 = PDDocument.load(testFormAsByte);
PDAcroForm acroform = pdf1.getDocumentCatalog().getAcroForm();
acroform.flatten();
Path flattendedPdf = Files.createTempFile("flatten", ".pdf");
pdf1.save(flattendedPdf.toFile());
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(new ByteArrayInputStream(Files.readAllBytes(flattendedPdf)));
merger.addSource(new ByteArrayInputStream(testFormAsByte2));
merger.setDestinationFileName("./build/flattenAndMerge.pdf");
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
}
I am using PDFBox 2.0.8.
This is the input file: https://ufile.io/6etxp
Here is the result of the test: https://ufile.io/bh94n
As I could see the problem only occures with checkboxes, normal text fields will be removed correctly

As already mentioned in a comment:
Indeed, this is a bug. It is not, though, as the OP has assumed that flattening does not remove the fields, it is a problem of the merging code in PDFMergerUtility.mergeAcroForm.
The underlying problem is in the handling of non-trivial field hierarchies: In the sample source document shared by the OP the checkbox fields are not top-level fields but they are located under the top level node "cb_a".
In the merged document they are not only renamed but also added to the list of top-level form fields; this actually is not valid as they still have a Parent reference to "cb_a".
This bug currently is discussed and resolved in the context of the Apacha Jira entry PDFBOX-4066.

Related

Apache PDFBox - Not able to read all fields from PDF

We are trying to read a PDF and populate values in it dynamically. Based on a incoming request we run some rules and derive what PDF to use and then populate values to it dynamically. We are using Apache PDFBox version 2.0.11 and for some reason we are facing issues with a particular PDF Template. We are not able to read some of the fields for this particular template and generated PDF is incomplete. Wondering if something to do with original PDF itself. Here is the code snippet we are using to read fields and populate it.
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
acroForm.setXFA(null);
COSArrayList<PDField> list = (COSArrayList<PDField>) acroForm.getFields();
for (PDField field : list) {
field.setReadOnly(true);
logger.debug("Field name "+field.getFullyQualifiedName())))
//use logic to populate value by calling field.setValue();
}
When we tried to print each field name we observed that more than 30 percent of the fields are missing. Can any one help on how to fix it? PDF is of 15 pages with different questions. If the issue is with Original PDF itself then what might be reason to not able read some of the fields?
You probably have hierarchical fields on that form. Try something like the code below instead...
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDFieldTree fieldTree = acroForm.getFieldTree();
Iterator<PDField> fieldTreeIterator = fieldTree.iterator();
while (fieldTreeIterator.hasNext()) {
PDField field = fieldTreeIterator.next();
if (field instanceof PDTerminalField) {
String fullyQualifiedName = field.getFullyQualifiedName();
logger.debug("Field name "+fullyQualifiedName);
}
}
PDAcroForm.getFields() only gets the root fields, not their children. PDAcroForm.getFieldTree() gets all fields but then you need to test to see if they're terminal before setting a value. Non-terminal fields can't have a value and don't have widgets (representations on the page) associated with them. You'll know this is the problem if the fully qualified name has periods in it. The periods represent the hierarchy.
Issue was resolved after reconstructing the whole PDF again.

Java PDFBox does not maintain the font appearence of a field if it appears severraly in a PDF Form

I need to fill a pdf form dynamically from my java web app and I found PDFBox to be really useful except for an issue or challenge am facing when I have multiple fields with same name.
I have 5 fields with same name(lets say 'wcode') in different places on a one page pdf form. This fields have different fonts. Normally when you fill out one field manually the other fields automatically pick the sames value, the same this happens when I fill it using PDFbox except that PDFBox changes all my 5 fields to have same font as the first field to appear in the pdf form.
Here is the code used to fill the field.
PDDocument _pdfDocument = PDDocument.load(new File(originalPdf))
PDAcroForm acroForm = _pdfDocument.getDocumentCatalog().getAcroForm();
PDTextField myCodeField = (PDTextField) acroForm.getField("wcode");
if (myCodeField != null) {
myCodeField .setValue(my.getCode());
}
//Refresh layout && Flatten the document
acroForm.refreshAppearances();
acroForm.flatten();
_pdfDocument.save(outputFile);
I added
acroForm.refreshAppearances();
after some research but that did not change anything.
So if the first 'wcode' field to appear on the pdf form is 6pt all the other 'wcode' fields in the rest the pdf will be 6pt even if I had set them in appearance properties to 12pt.
I am using PDFBox 2.0.5
The issue has been resolved in version PDFBox 2.0.6 released about a month ago.
Check comment on the jira 3837 here

getting exception while redacting pdf using itext

I am getting below exception while trying to redact pdf document using itext.
The issue is very sporadic like sometime it is working and sometimes it is throwing error.
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:60)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:991)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpContentOperator.invoke(PdfCleanUpContentOperator.java:140)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:286)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:425)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUpPage(PdfCleanUpProcessor.java:160)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUp(PdfCleanUpProcessor.java:135)
at RedactionClass.tgestRedactJavishsInput(RedactionClass.java:56)
at RedactionClass.main(RedactionClass.java:23)
Code which i am using to redact is below:
public static void testRedact() throws IOException, DocumentException {
InputStream resource = new FileInputStream("D:/itext/edited_120192824_5 (1).pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR,
"aviteshs.pdf"));
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
int pageCount = reader.getNumberOfPages();
Rectangle linkLocation1 = new Rectangle(440f, 700f, 470f, 710f);
Rectangle linkLocation2 = new Rectangle(308f, 205f, 338f, 215f);
Rectangle linkLocation3 = new Rectangle(90f, 155f, 130f, 165f);
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
for (int currentPage = 1; currentPage <= pageCount; currentPage++) {
if (currentPage == 1) {
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation1, BaseColor.BLACK));
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation2, BaseColor.BLACK));
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation3, BaseColor.BLACK));
} else {
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation1, BaseColor.BLACK));
}
}
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,
stamper);
try {
cleaner.cleanUp();
} catch (Exception e) {
e.printStackTrace();
}
stamper.close();
reader.close();
}
Due to customer document i am unable to share it , trying to find out some test data for same.
Please find the doc here:
https://drive.google.com/file/d/0B-zalNTEeIOwM1JJVWctcW8ydU0/view?usp=drivesdk
In short: The cause of the NullPointerException here is that iText does not support form XObject resource inheritance from the page they are displayed on. According to the PDF specification this construct is obsolete but it can be encountered in PDFs obeying early PDF references instead of the specification.
The cause
Page 1 of the document in question contains 4 XObject resources named I1, M0, P1, and Q0:
As you can see in the screenshot, Q0 in particular has no own Resources dictionary. But its last instructions are
q
413 0 0 125 75 3086 cm
/I1 Do
Q
Id est it references a resource I1.
Now iText in case of form XObjects assumes that the resources their contents reference are contained in their own Resources dictionary.
The result: iText accesses a null dictionary and a NullPointerException occurs.
The specification
The PDF specification ISO 32000-1 specifies:
A resource dictionary shall be associated with a content stream in one of the following ways:
For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.
For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.
PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.
(ISO 32000-1, section 7.8.3 - Resource Dictionaries)
Thus, in the case at hand we are in the situation of the obsolete option three, Q0 references the XObject I1 defined in the resource dictionary of the page Q0 is used for.
The document in question has a version header claiming PDF 1.5 conformance (in contrast to PDF 1.7 of the PDF specification). So let's look at the PDF Reference 1.5. The paragraph there corresponding to option three is:
A form XObject or a Type 3 font’s glyph description may omit the Resources
entry, in which case resources will be looked up in the Resources entry of the
page on which the form or font is used. This practice is not recommended.
Summarized, therefore, the PDF in question uses a construct which the PDF specification (published in 2008, in use for nine years!) calls obsolete and even the PDF Reference the file claims conformance to recommends against. iText, on the other hand, does not support this obsolete construct.
Ideas how to fix this
Essentially the PDF Cleanup code must be extended to
remember the resources of the current page in the PdfCleanUpProcessor and
use these current page resources in the PdfCleanUpContentOperator method invoke in case of a Do operator referring to form XObject without own resources.
Unfortunately some members used in invoke are private. Thus, one has to either copy the PdfCleanUp code or fall back on reflection.
(iText 5.5.12-SNAPSHOT)
iText 7
The iText 7 PDF CleanUp tool also runs into an issue for your PDF, here the exception is a IllegalStateException claiming "Graphics state is always deleted after event dispatching. If you want to preserve it in renderer info, use preserveGraphicsState method after receiving renderer info."
As this exception is thrown during event dispatching, this error message does not make sense. Unfortunately the PDF CleanUp tool has become closed source in iText 7, so it is not so easy pinpointing the issue.
(iText 7.0.3-SNAPSHOT; PDF CleanUp 1.0.2-SNAPSHOT)

Problems scanning the 1042-S 2015 PDF file with iText

I am building a program that will write automatically into a PDF file. I am using the library iText to do.
Well, to check the name of the fields I run this small code:
public static void main(String[] args) throws IOException {
PdfReader reader = new PdfReader(PDF_PATH);
AcroFields fields = reader.getAcroFields();
Set<String> fldNames = fields.getFields().keySet();
for (String fldName : fldNames) {
System.out.println( fldName + ": " + fields.getField(fldName));
}
}
The output is something like:
topmostSubform[0].CopyA[0].Group12-13[0].Line13d-g[0].Line13e[0]: 13e
topmostSubform[0].CopyB[0].Group1-11[0].Line3[0].Line3[0]: 0
topmostSubform[0].CopyE[0].Group1-11[0].Line7[0]: 7
topmostSubform[0].CopyD[0].Group14-24[0].Line16[0].Line15i[0]: 15i
the topmostSubform[0].CopyE[0].Group1-11[0].Line7[0] is the value that I am looking for and what comes after the : is the value that I put in the original PDF to keep track of the variable names of each field.
So far so good, but I am having problem with 1 specific field. The field number 16. I input the value 16 to keep track but in my output there is only 1 16 output but It was supposed to have 5 Copies, the CopyA , CopyB, CopyC, CopyD and CopyE. What I find is only this:
topmostSubform[0].CopyA[0].Group14-24[0].Line16[0] and when I try to write in this field using this code:
form.setField("topmostSubform[0].CopyA[0].Group14-24[0].Line16[0]", "BLA BLA BLA"); it does not work. Obviously something weird is happening with the 16 Field.
The PDF can be Downloaded at: https://www.irs.gov/pub/irs-prior/f1042s--2015.pdf
Thank you.
The form is a hybrid XFA form (or, as I like to call such forms, an abomination). In a hybrid XFA form, the fields of the form are described twice, once using PDF syntax (pure AcroForm technology), once using XML (the XML Forms Architecture, aka XFA).
This is problematic because:
There are differences between the form functionality in AcroForm functionality versus the XML Forms Architecture.
There's always the risk that the form described in PDF syntax doesn't correspond with the form in XML syntax.
That's why I always throw away the XML syntax. See the FillHybridForm example:
public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
AcroFields form = stamper.getAcroFields();
form.removeXfa();
form.setField("topmostSubform[0].CopyA[0].Group14-24[0].Line16[0]", "16");
stamper.close();
reader.close();
}
This line is the one you probably don't have in your code:
form.removeXfa();
Please read my answers to the following questions for more info:
How to check a checkbox in PDF file with the same variable name with iText and Java
How to change the text color of an AcroForm field?
Is it safe to remove XFA?
If you only have time to read one Q&A from the list above, choose the last one in the list.

How do I get Custom Format Script of Pdf form fields using iText?

I have a pdf form and I am writing some data to the pdf fields through code using iText & Java. I want to get Custom Format Script of the fields so that I get to know the valid inputs for the pdf fields. Thanks
Update:
OOPS,
I misinterpreted your question. I assumed you wanted to add JavaScript to an existing PDF so that the user can add valid data.
What you are looking for is a way to extract the JavaScript from an existing PDF. You already get an impression on how to do that in the answer I referred to in my previous answer. For JavaScript added to specific fields, you need to inspect the Additional Actions:
AcroFields form = stamper.getAcroFields();
AcroFields.Item fd = form.getFieldItem("married");
// Get the PDF dictionary of the YES radio button and add an additional action
PdfDictionary dictYes =
(PdfDictionary) PdfReader.getPdfObject(fd.getWidgetRef(0));
PdfDictionary yesAction = dictYes.getAsDict(PdfName.AA);
if (yesAction == null) yesAction = new PdfDictionary();
yesAction.put(new PdfName("Fo"),
PdfAction.javaScript("setReadOnly(false);", stamper.getWriter()));
dictYes.put(PdfName.AA, yesAction);
// Get the PDF dictionary of the NO radio button and add an additional action
PdfDictionary dictNo =
(PdfDictionary) PdfReader.getPdfObject(fd.getWidgetRef(1));
PdfDictionary noAction = dictNo.getAsDict(PdfName.AA);
If noAction isn't null, you'll need to examine the different values in that dictionary. For instance: the /Bl entry (if present) will give you the action that is triggered on blur, the /Fo entry (if present) will give you the action that is triggerd on focus, and so on.
If you want to get the document-level JavaScript, you need to fetch the appropriate entry from the name tree in the Catalog of the PDF document. It is hard to explain in words, but if you download RUPS and inspect the PDF, you should be able to find the different JavaScript snippets. If we look at the file created using my incorrect answer (see below), we get this:
This shows that you need something like this:
PdfDictionary catalog = reader.getCatalog();
PdfDictionary names = catalog.getAsDict(PdfName.NAMES);
PdfDictionary javascript = names.getAsDict(PdfName.JAVASCRIPT);
Once you have this javascript dictionary, you can extract all the Javascript snippets.
Incorrect answer:
I assume that you know how to write JavaScript (or more correctly ECMAScript). JavaScript in PDF is very similar to JavaScript in HTML. I assume you don't need help to write some JavaScript methods to check if input is valid.
If that is the case, you only need to know how to add the JavaScript to an existing PDF file. For instance: I have this PDF file named form_without_js.pdf to which I want to add some javascript, for instance extra.js. In extra.js, you'll find a method that sets a field to read-only as well as a method that validates a field: if the value of married is yes and there is no value for partner, an alert is shown, otherwise the form is submitted. You'll have to write similar JavaScript depending on the nature of your form and which fields you want to check.
The AddJavaScriptToForm example shows you how to add these JavaScript methods to the existing PDF, resulting in the file form_with_js.pdf.
This is done with PdfReader and PdfStamper:
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
// do stuff
stamper.close();
reader.close();
Where it says // do stuff, you need to do two things:
[1.] You need to add the JavaScript snippet like this:
PdfWriter writer = stamper.getWriter();
writer.addJavaScript(Utilities.readFileToString(RESOURCE));
[2.] You need to add some JavaScript to specific fields to call the custom methods you've added.
In the example, you see a case where we add JavaScript as an additional action. You also see how we add a new button with a specific action. It's up to you to decide what is needed in your specific case.

Categories