Apache PDFBox - Not able to read all fields from PDF

Apache PDFBox - Not able to read all fields from PDF - java

We are trying to read a PDF and populate values in it dynamically. Based on a incoming request we run some rules and derive what PDF to use and then populate values to it dynamically. We are using Apache PDFBox version 2.0.11 and for some reason we are facing issues with a particular PDF Template. We are not able to read some of the fields for this particular template and generated PDF is incomplete. Wondering if something to do with original PDF itself. Here is the code snippet we are using to read fields and populate it.
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
acroForm.setXFA(null);
COSArrayList<PDField> list = (COSArrayList<PDField>) acroForm.getFields();
for (PDField field : list) {
field.setReadOnly(true);
logger.debug("Field name "+field.getFullyQualifiedName())))
//use logic to populate value by calling field.setValue();
}
When we tried to print each field name we observed that more than 30 percent of the fields are missing. Can any one help on how to fix it? PDF is of 15 pages with different questions. If the issue is with Original PDF itself then what might be reason to not able read some of the fields?

You probably have hierarchical fields on that form. Try something like the code below instead...
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDFieldTree fieldTree = acroForm.getFieldTree();
Iterator<PDField> fieldTreeIterator = fieldTree.iterator();
while (fieldTreeIterator.hasNext()) {
PDField field = fieldTreeIterator.next();
if (field instanceof PDTerminalField) {
String fullyQualifiedName = field.getFullyQualifiedName();
logger.debug("Field name "+fullyQualifiedName);
}
}
PDAcroForm.getFields() only gets the root fields, not their children. PDAcroForm.getFieldTree() gets all fields but then you need to test to see if they're terminal before setting a value. Non-terminal fields can't have a value and don't have widgets (representations on the page) associated with them. You'll know this is the problem if the fully qualified name has periods in it. The periods represent the hierarchy.

Issue was resolved after reconstructing the whole PDF again.

Related

PDFBOX 2.0.18 - How to iterates through pages of a PDF and retrieve specific fields

I'm using PDFBox to read specific fields on a pdf document. Actually, I'm able to get all the informations I want with a pdf containing only one page. The PDF has fields with specific names and I can get all the fields and insert it in a database.
I use this code with AccroForm to access the fields
InputStream document = item.getInputStream();
pdf = PDDocument.load(new RandomAccessBufferedFileInputStream(document));
pdCatalog = pdf.getDocumentCatalog();
pdAcroForm = pdCatalog.getAcroForm();
String dateRapport = pdAcroForm.getField("import_Date01").getValueAsString();
String radioReason = pdAcroForm.getField("NoFlight").getValueAsString();
boolean hasdata = false;
if(radioRaison.length() > 0 && !radioRaison.equals("Off")) {
if(radioRaison.equals("NR")) {
rvhi.setRaison(obtenirRaison(raisons, "NR"));
}else if(radioRaison.equals("WX")) {
rvhi.setRaison(obtenirRaison(raisons, "ME"));
}else if(radioRaison.equals("US")) {
rvhi.setRaison(obtenirRaison(raisons, "BR"));
}
}
if(pdAcroForm.getField("import_Hmn0"+indexEnString).getValueAsString().length() > 0)
{
hasdata = true
}
pdf.close();
return hasdata;
Now, my problem is to do the same thing with a pdf that contains multiple identical pages with the same field names, but with different data in the fields. I would like to iterate through each pages and call the same method and retrieve the fields data on each page.
I use this code below to iterate through pages of the pdf, but I don't know how to get the fields on the current page... I don't know how to get the acroform fields from the PDPage object?
PDPageTree nbPages = pdf.getPages();
if(nbPages.getCount() > 1) {
for(PDPage page : nbPages) {
???? how to get fields Acroform from PDPage page ???
}
}
Thanks in advance for your responses!

There is no such thing as a list of PDField objects for the current page; an AcroForm is document wide. So the first part of your question already gets the full list of fields in the document. (12.7.1 in the PDF Specification from Adobe)
Fields can have the same fully qualified name, but then their values also have to be the same. (12.7.3.2 in the PDF Specification)
What probably happens in your document is that the partial name of the field is the same, but the fully qualified name isn't the same. The fully qualified name is formed by concatenating the name of the field and the name of the ancestor objects, as in "parent partial name"."child partial name".
So basically you'll have to use the fully qualified name to find the field, or you need to iterate over the list of fields to find all fields you have in the document.
You could find the page on which a particular field is displayed as a field uses annotations (widget annotations) to show itself on a page. These annotations do live in an Annots array on the page level. Whether there is a convenience function in pdfbox to do this easily, I don't know.

Sorry for the late response...
Thank you #DavidvanDriessche. To find the composition of the fullyQualifiedName, I used a small function to list all fields and their childs node if they have one. It turns out that for the second page of the document, the page number was specified as the parent partial name. For example, the first page have, "fieldNameExample.fieldNameExmaple" as fully qualified name and the second page have "1.fieldNameExample" as fully qualified name. So I can assume that for every subsequent pages, it will be the page number.fieldNameExample as the fully qualified name.
Thanks everyone for your help!

PdfBox flatten pdf does not remove acroform elements

I have a pdf with a lot of acroforms, I do some manipulation on it which results in a new pdf.
So I have PDF-1 (which is the original one )and PDF-2 (just a duplication of PDF-1), now I want to merge them. Both PDFs have some acroforms for example: field_a, field_2...
Before I merge them I flatten PDF-1, because I only want to have acrofields from PDF-2. When I check then my new merged PDF I can see that there are no visible fields on on the pages from PDF-1 and there are fields on pages of fields of PDF-2. At the first look it seems ok, but when I inspect the fields I can see that the merger has renamed all the fields for PDF-2 e.g. field_a_dummy123, field_b_dummy232 ...
It seems to me, that flattening does not remove the fields and thats why the PDFMerger from PDFBox will rename the fields for PDF-2 because acrofields need to be unique. Is there a way to completely remove the acroforms of PDF-1?
#Test
public void flattenAndMerge() throws IOException {
File testForm = new File(classLoader.getResource("./TestForm.pdf").getFile());
byte[] testFormAsByte = Files.readAllBytes(testForm.toPath());
byte[] testFormAsByte2 = Files.readAllBytes(testForm.toPath());
PDDocument pdf1 = PDDocument.load(testFormAsByte);
PDAcroForm acroform = pdf1.getDocumentCatalog().getAcroForm();
acroform.flatten();
Path flattendedPdf = Files.createTempFile("flatten", ".pdf");
pdf1.save(flattendedPdf.toFile());
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(new ByteArrayInputStream(Files.readAllBytes(flattendedPdf)));
merger.addSource(new ByteArrayInputStream(testFormAsByte2));
merger.setDestinationFileName("./build/flattenAndMerge.pdf");
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
}
I am using PDFBox 2.0.8.
This is the input file: https://ufile.io/6etxp
Here is the result of the test: https://ufile.io/bh94n
As I could see the problem only occures with checkboxes, normal text fields will be removed correctly

As already mentioned in a comment:
Indeed, this is a bug. It is not, though, as the OP has assumed that flattening does not remove the fields, it is a problem of the merging code in PDFMergerUtility.mergeAcroForm.
The underlying problem is in the handling of non-trivial field hierarchies: In the sample source document shared by the OP the checkbox fields are not top-level fields but they are located under the top level node "cb_a".
In the merged document they are not only renamed but also added to the list of top-level form fields; this actually is not valid as they still have a Parent reference to "cb_a".
This bug currently is discussed and resolved in the context of the Apacha Jira entry PDFBOX-4066.

How to retrieve the full name of an acrofield with pdfbox

I have a pdf with acroform fields, now I need to find the corresponding PDField Objects.
For that I am using this code
public PDField getPDFieldWithName(final String fieldname){
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
return acroForm.getFields().stream()
.filter( x -> x.getPartialName().equalsIgnoreCase(fieldname))
.findFirst()
.get();
}
This works for normal fields, but if the fields on the pdfs are grouped with dots and one of those fields are used, it does not work.
As I could see PDFBox handles such fields as PDNonTerminalField, is there a easy way to get the latest child and check against it?
On my form there is a field named Test.foo.bar when I search with the above method for a field with name "Test.foo.bar" it does not find it
java.util.NoSuchElementException: No value present

getFields only returns the root fields. The better solution would be to call getFieldIterator(). Or just call getField(fullyQualifiedName) if you have the full name.

Java PDFBox does not maintain the font appearence of a field if it appears severraly in a PDF Form

I need to fill a pdf form dynamically from my java web app and I found PDFBox to be really useful except for an issue or challenge am facing when I have multiple fields with same name.
I have 5 fields with same name(lets say 'wcode') in different places on a one page pdf form. This fields have different fonts. Normally when you fill out one field manually the other fields automatically pick the sames value, the same this happens when I fill it using PDFbox except that PDFBox changes all my 5 fields to have same font as the first field to appear in the pdf form.
Here is the code used to fill the field.
PDDocument _pdfDocument = PDDocument.load(new File(originalPdf))
PDAcroForm acroForm = _pdfDocument.getDocumentCatalog().getAcroForm();
PDTextField myCodeField = (PDTextField) acroForm.getField("wcode");
if (myCodeField != null) {
myCodeField .setValue(my.getCode());
}
//Refresh layout && Flatten the document
acroForm.refreshAppearances();
acroForm.flatten();
_pdfDocument.save(outputFile);
I added
acroForm.refreshAppearances();
after some research but that did not change anything.
So if the first 'wcode' field to appear on the pdf form is 6pt all the other 'wcode' fields in the rest the pdf will be 6pt even if I had set them in appearance properties to 12pt.
I am using PDFBox 2.0.5

The issue has been resolved in version PDFBox 2.0.6 released about a month ago.
Check comment on the jira 3837 here

getting null when call acroform.getFields() using pdfbox

I tried to get All the fields available in pdf form but I'm encountering a NullPointerException when calling acroform.getFields() using PDFBox.
Sample:
pdDoc = PDDocument.load(fileName);
PDAcroForm form = pdDoc.getDocumentCatalog().getAcroForm();
if(form!=null)
{
List<PDField> field = form.getFields(); //here I am getting null pointer exception
}

this is because your pdf if not contain any acroform

I had this same error, and it turned out I was merely assuming all PDFs in our collection from this particular screen would have fields. It turned out that was not the case and that we had clients with certain pdfs that had no fields at all. So just add a null check to make sure AcroForm is not null and you should be good to go.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache PDFBox - Not able to read all fields from PDF - java

Issue was resolved after reconstructing the whole PDF again.

Related

PDFBOX 2.0.18 - How to iterates through pages of a PDF and retrieve specific fields

PdfBox flatten pdf does not remove acroform elements

How to retrieve the full name of an acrofield with pdfbox

Java PDFBox does not maintain the font appearence of a field if it appears severraly in a PDF Form

getting null when call acroform.getFields() using pdfbox

Categories

Resources