getting null when call acroform.getFields() using pdfbox - java

I tried to get All the fields available in pdf form but I'm encountering a NullPointerException when calling acroform.getFields() using PDFBox.
Sample:
pdDoc = PDDocument.load(fileName);
PDAcroForm form = pdDoc.getDocumentCatalog().getAcroForm();
if(form!=null)
{
List<PDField> field = form.getFields(); //here I am getting null pointer exception
}

this is because your pdf if not contain any acroform

I had this same error, and it turned out I was merely assuming all PDFs in our collection from this particular screen would have fields. It turned out that was not the case and that we had clients with certain pdfs that had no fields at all. So just add a null check to make sure AcroForm is not null and you should be good to go.

Related

Apache PDFBox - Not able to read all fields from PDF

We are trying to read a PDF and populate values in it dynamically. Based on a incoming request we run some rules and derive what PDF to use and then populate values to it dynamically. We are using Apache PDFBox version 2.0.11 and for some reason we are facing issues with a particular PDF Template. We are not able to read some of the fields for this particular template and generated PDF is incomplete. Wondering if something to do with original PDF itself. Here is the code snippet we are using to read fields and populate it.
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
acroForm.setXFA(null);
COSArrayList<PDField> list = (COSArrayList<PDField>) acroForm.getFields();
for (PDField field : list) {
field.setReadOnly(true);
logger.debug("Field name "+field.getFullyQualifiedName())))
//use logic to populate value by calling field.setValue();
}
When we tried to print each field name we observed that more than 30 percent of the fields are missing. Can any one help on how to fix it? PDF is of 15 pages with different questions. If the issue is with Original PDF itself then what might be reason to not able read some of the fields?
You probably have hierarchical fields on that form. Try something like the code below instead...
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDFieldTree fieldTree = acroForm.getFieldTree();
Iterator<PDField> fieldTreeIterator = fieldTree.iterator();
while (fieldTreeIterator.hasNext()) {
PDField field = fieldTreeIterator.next();
if (field instanceof PDTerminalField) {
String fullyQualifiedName = field.getFullyQualifiedName();
logger.debug("Field name "+fullyQualifiedName);
}
}
PDAcroForm.getFields() only gets the root fields, not their children. PDAcroForm.getFieldTree() gets all fields but then you need to test to see if they're terminal before setting a value. Non-terminal fields can't have a value and don't have widgets (representations on the page) associated with them. You'll know this is the problem if the fully qualified name has periods in it. The periods represent the hierarchy.
Issue was resolved after reconstructing the whole PDF again.

PdfBox flatten pdf does not remove acroform elements

I have a pdf with a lot of acroforms, I do some manipulation on it which results in a new pdf.
So I have PDF-1 (which is the original one )and PDF-2 (just a duplication of PDF-1), now I want to merge them. Both PDFs have some acroforms for example: field_a, field_2...
Before I merge them I flatten PDF-1, because I only want to have acrofields from PDF-2. When I check then my new merged PDF I can see that there are no visible fields on on the pages from PDF-1 and there are fields on pages of fields of PDF-2. At the first look it seems ok, but when I inspect the fields I can see that the merger has renamed all the fields for PDF-2 e.g. field_a_dummy123, field_b_dummy232 ...
It seems to me, that flattening does not remove the fields and thats why the PDFMerger from PDFBox will rename the fields for PDF-2 because acrofields need to be unique. Is there a way to completely remove the acroforms of PDF-1?
#Test
public void flattenAndMerge() throws IOException {
File testForm = new File(classLoader.getResource("./TestForm.pdf").getFile());
byte[] testFormAsByte = Files.readAllBytes(testForm.toPath());
byte[] testFormAsByte2 = Files.readAllBytes(testForm.toPath());
PDDocument pdf1 = PDDocument.load(testFormAsByte);
PDAcroForm acroform = pdf1.getDocumentCatalog().getAcroForm();
acroform.flatten();
Path flattendedPdf = Files.createTempFile("flatten", ".pdf");
pdf1.save(flattendedPdf.toFile());
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(new ByteArrayInputStream(Files.readAllBytes(flattendedPdf)));
merger.addSource(new ByteArrayInputStream(testFormAsByte2));
merger.setDestinationFileName("./build/flattenAndMerge.pdf");
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
}
I am using PDFBox 2.0.8.
This is the input file: https://ufile.io/6etxp
Here is the result of the test: https://ufile.io/bh94n
As I could see the problem only occures with checkboxes, normal text fields will be removed correctly
As already mentioned in a comment:
Indeed, this is a bug. It is not, though, as the OP has assumed that flattening does not remove the fields, it is a problem of the merging code in PDFMergerUtility.mergeAcroForm.
The underlying problem is in the handling of non-trivial field hierarchies: In the sample source document shared by the OP the checkbox fields are not top-level fields but they are located under the top level node "cb_a".
In the merged document they are not only renamed but also added to the list of top-level form fields; this actually is not valid as they still have a Parent reference to "cb_a".
This bug currently is discussed and resolved in the context of the Apacha Jira entry PDFBOX-4066.

How to retrieve the full name of an acrofield with pdfbox

I have a pdf with acroform fields, now I need to find the corresponding PDField Objects.
For that I am using this code
public PDField getPDFieldWithName(final String fieldname){
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
return acroForm.getFields().stream()
.filter( x -> x.getPartialName().equalsIgnoreCase(fieldname))
.findFirst()
.get();
}
This works for normal fields, but if the fields on the pdfs are grouped with dots and one of those fields are used, it does not work.
As I could see PDFBox handles such fields as PDNonTerminalField, is there a easy way to get the latest child and check against it?
On my form there is a field named Test.foo.bar when I search with the above method for a field with name "Test.foo.bar" it does not find it
java.util.NoSuchElementException: No value present
getFields only returns the root fields. The better solution would be to call getFieldIterator(). Or just call getField(fullyQualifiedName) if you have the full name.

Java PDFBox does not maintain the font appearence of a field if it appears severraly in a PDF Form

I need to fill a pdf form dynamically from my java web app and I found PDFBox to be really useful except for an issue or challenge am facing when I have multiple fields with same name.
I have 5 fields with same name(lets say 'wcode') in different places on a one page pdf form. This fields have different fonts. Normally when you fill out one field manually the other fields automatically pick the sames value, the same this happens when I fill it using PDFbox except that PDFBox changes all my 5 fields to have same font as the first field to appear in the pdf form.
Here is the code used to fill the field.
PDDocument _pdfDocument = PDDocument.load(new File(originalPdf))
PDAcroForm acroForm = _pdfDocument.getDocumentCatalog().getAcroForm();
PDTextField myCodeField = (PDTextField) acroForm.getField("wcode");
if (myCodeField != null) {
myCodeField .setValue(my.getCode());
}
//Refresh layout && Flatten the document
acroForm.refreshAppearances();
acroForm.flatten();
_pdfDocument.save(outputFile);
I added
acroForm.refreshAppearances();
after some research but that did not change anything.
So if the first 'wcode' field to appear on the pdf form is 6pt all the other 'wcode' fields in the rest the pdf will be 6pt even if I had set them in appearance properties to 12pt.
I am using PDFBox 2.0.5
The issue has been resolved in version PDFBox 2.0.6 released about a month ago.
Check comment on the jira 3837 here

How to set multiline value to PDField

I'm using PDFBox 1.8.x for filling out an acroform in a PDF document. For this, I'm iterating over all PDFields in the document and setting the correct values which are stored in my HashMap (aParameter).
PDAcroForm acroform = aTemplate.getDocumentCatalog().getAcroForm();
List fields = acroform.getFields();
for(int i=0; i<fields.size(); i++)
{
PDField field = (PDField) fields.get(i);
String lValue = aParameter.get(field.getPartialName());
if(lValue != null) field.setValue(lValue);
field.setReadonly(true);
}
It's working very great except for my one and only multiline textbox. In my resulting PDF, the correct text is in my multiline textbox, but without any new lines. While searching for reasons I found some very old answers that PDFBox doesn't support multilines, it can't set a new line character on it's own. Is this behavior still up to date? Or are there any possibilities to automatically break the text to a new line, when the width of the textbox is reached?
This answer looks like a solution, but they are using another technique (drawString) to write the PDF. Is there a possibility to modify this idea to match my behavior of filling out PDFields?
I'm happy about any ideas!

Categories