I'm using PDFBox 1.8.x for filling out an acroform in a PDF document. For this, I'm iterating over all PDFields in the document and setting the correct values which are stored in my HashMap (aParameter).
PDAcroForm acroform = aTemplate.getDocumentCatalog().getAcroForm();
List fields = acroform.getFields();
for(int i=0; i<fields.size(); i++)
{
PDField field = (PDField) fields.get(i);
String lValue = aParameter.get(field.getPartialName());
if(lValue != null) field.setValue(lValue);
field.setReadonly(true);
}
It's working very great except for my one and only multiline textbox. In my resulting PDF, the correct text is in my multiline textbox, but without any new lines. While searching for reasons I found some very old answers that PDFBox doesn't support multilines, it can't set a new line character on it's own. Is this behavior still up to date? Or are there any possibilities to automatically break the text to a new line, when the width of the textbox is reached?
This answer looks like a solution, but they are using another technique (drawString) to write the PDF. Is there a possibility to modify this idea to match my behavior of filling out PDFields?
I'm happy about any ideas!
Related
I've started to work with PDType0Font recently (we've used PDType1Font.HELVETICA but needed unicode support) and I'm facing an error where i'm adding lines to the file using PDPageContentStream but PDFTextStripper.getText doesn't get the updated file contents.
I'm loading the font:
PDType0Font.load(document, fontFile)
And creating the contentStream as follows:
PDPageContentStream(document, pdPage, PDPageContentStream.AppendMode.PREPEND, false)
my function that adds content to the pdf is:
private fun addTextToContents(contentStream: PDPageContentStream, txtLines: List<String>, x: Float, y: Float, pdfFont: PDFont, fontSize: Float, maxWidth: Float) {
contentStream.beginText()
contentStream.setFont(pdfFont, fontSize)
contentStream.newLineAtOffset(x, y)
txtLines.forEach { txt ->
contentStream.showText(txt)
contentStream.newLineAtOffset(0.0F, -fontSize)
}
contentStream.endText()
contentStream.close()
When i'm trying to read the content of the file using PDFTextStripper.getText i'm getting the file before the changes.
However, if I'm adding document.save before reading to PDFTextStripper, it works.
val txt: String = PDFTextStripper().getText(doc) //not working
doc.save(//File)
val txt: String = PDFTextStripper().getText(doc) //working
if I'm using PDType1Font.HELVETICA in
contentStream.setFont(pdfFont, fontSize)
Everything is working without any problems and without saving the doc before reading the text.
I'm suspecting that the issue is with the code in PDPageContentStream.showTextInternal():
// Unicode code points to keep when subsetting
if (font.willBeSubset())
{
int offset = 0;
while (offset < text.length())
{
int codePoint = text.codePointAt(offset);
font.addToSubset(codePoint);
offset += Character.charCount(codePoint);
}
}
This is the only thing that is not the same when using PDType0Font with embedsubsets and PDType1Font.
Can someone help with this?
What am I doing wrong?
Your question, in particular the quoted code, already hints at the answer to your question:
When using a font that will be subset (font.willBeSubset() == true), the associated PDF objects are unfinished until the file is saved. Text extraction on the other hand needs the finished PDF objects to properly work. Thus, don't apply text extraction to a document that is still being created and uses fonts that will be subset.
You describe your use case as
for our unit tests, we are adding text (mandatory text for us) to the document and then using PDFTextStripper we are validating that the file has the proper fields.
As Tilman proposes: Then it would make more sense to save the PDF, and then to reload. That would be a more realistic test. Not saving is cutting corners IMHO.
Indeed, in unit tests you should first produce the final PDF as it will be sent out (i.e. saving it, either to the file system or to memory), then reload that file, and test only this reloaded document.
We are trying to read a PDF and populate values in it dynamically. Based on a incoming request we run some rules and derive what PDF to use and then populate values to it dynamically. We are using Apache PDFBox version 2.0.11 and for some reason we are facing issues with a particular PDF Template. We are not able to read some of the fields for this particular template and generated PDF is incomplete. Wondering if something to do with original PDF itself. Here is the code snippet we are using to read fields and populate it.
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
acroForm.setXFA(null);
COSArrayList<PDField> list = (COSArrayList<PDField>) acroForm.getFields();
for (PDField field : list) {
field.setReadOnly(true);
logger.debug("Field name "+field.getFullyQualifiedName())))
//use logic to populate value by calling field.setValue();
}
When we tried to print each field name we observed that more than 30 percent of the fields are missing. Can any one help on how to fix it? PDF is of 15 pages with different questions. If the issue is with Original PDF itself then what might be reason to not able read some of the fields?
You probably have hierarchical fields on that form. Try something like the code below instead...
PDDocument pdfTemplate = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdfTemplate.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDFieldTree fieldTree = acroForm.getFieldTree();
Iterator<PDField> fieldTreeIterator = fieldTree.iterator();
while (fieldTreeIterator.hasNext()) {
PDField field = fieldTreeIterator.next();
if (field instanceof PDTerminalField) {
String fullyQualifiedName = field.getFullyQualifiedName();
logger.debug("Field name "+fullyQualifiedName);
}
}
PDAcroForm.getFields() only gets the root fields, not their children. PDAcroForm.getFieldTree() gets all fields but then you need to test to see if they're terminal before setting a value. Non-terminal fields can't have a value and don't have widgets (representations on the page) associated with them. You'll know this is the problem if the fully qualified name has periods in it. The periods represent the hierarchy.
Issue was resolved after reconstructing the whole PDF again.
I have a pdf with acroform fields, now I need to find the corresponding PDField Objects.
For that I am using this code
public PDField getPDFieldWithName(final String fieldname){
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
return acroForm.getFields().stream()
.filter( x -> x.getPartialName().equalsIgnoreCase(fieldname))
.findFirst()
.get();
}
This works for normal fields, but if the fields on the pdfs are grouped with dots and one of those fields are used, it does not work.
As I could see PDFBox handles such fields as PDNonTerminalField, is there a easy way to get the latest child and check against it?
On my form there is a field named Test.foo.bar when I search with the above method for a field with name "Test.foo.bar" it does not find it
java.util.NoSuchElementException: No value present
getFields only returns the root fields. The better solution would be to call getFieldIterator(). Or just call getField(fullyQualifiedName) if you have the full name.
I try to add form fields to existing PDF file but the following error appears PDFbox Could not find font: /Helv
My code in Java has the following view:
PDDocument pdf = PDDocument.load(inputStream);
PDDocumentCatalog docCatalog = pdf.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
PDPage page = pdf.getPage(0);
PDTextField textBox = new PDTextField(acroForm);
textBox.setPartialName("SampleField");
acroForm.getFields().add(textBox);
PDAnnotationWidget widget = textBox.getWidgets().get(0);
PDRectangle rect = new PDRectangle(0, 0, 0, 0);
widget.setRectangle(rect);
widget.setPage(page);
widget.setAppearance(acroForm.getFields().get(0).getWidgets().get(0).getAppearance());
widget.setPrinted(false);
page.getAnnotations().add(widget);
acroForm.refreshAppearances();
acroForm.flatten();
pdf.save(outputStream);
pdf.close();
Do you have any ideas why the exception appears?
There is top of stack trace
java.io.IOException: Could not find font: /Helv
at org.apache.pdfbox.pdmodel.interactive.form.PDDefaultAppearanceString.processSetFont(PDDefaultAppearanceString.java:179)
at org.apache.pdfbox.pdmodel.interactive.form.PDDefaultAppearanceString.processOperator(PDDefaultAppearanceString.java:132)
at org.apache.pdfbox.pdmodel.interactive.form.PDDefaultAppearanceString.processAppearanceStringOperators(PDDefaultAppearanceString.java:108)
at org.apache.pdfbox.pdmodel.interactive.form.PDDefaultAppearanceString.<init>(PDDefaultAppearanceString.java:86)
at org.apache.pdfbox.pdmodel.interactive.form.PDVariableText.getDefaultAppearanceString(PDVariableText.java:93)
at org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.<init>(AppearanceGeneratorHelper.java:100)
at org.apache.pdfbox.pdmodel.interactive.form.PDTextField.constructAppearances(PDTextField.java:262)
at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.refreshAppearances(PDAcroForm.java:368)
at com.workjam.service.impl.PDFService.fillForm(PDFService.java:85)
Here is the link for PDF: https://drive.google.com/file/d/0B2--NSDOiujoR3hOZFYteUl2UE0/view?usp=sharing
Your new text field doesn't have a default appearance, so PDFBox makes one for you (/Helv 0 Tf 0 g).
Solution 1: get it from the field you're using (this will not work with every PDF because you're making several assumptions, i.e. that there is a field and that it is a text field)
textBox.setDefaultAppearance(((PDTextField)acroForm.getFields().get(0)).getDefaultAppearance());
Solution 2: initialize the default resources:
PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), PDType1Font.HELVETICA);
acroForm.setDefaultResources(resources);
See also the CreateSimpleForm.java example from the source code download.
Update: this has been fixed in 2.0.8, see issue PDFBOX-3943.
The cause is a combination of you and the source PDF not providing a default appearance for the text field and PDFBox providing defaults inconsequentially.
The default appearance
According to the specification, each field containing variable text (e.g. your text field) must have a DA default appearance value:
DA
string
(Required; inheritable) The default appearance string containing a sequence of valid page-content graphics or text state operators that define such properties as the field’s text size and colour.
(ISO 32000-1, Table 222 – Additional entries common to all fields containing variable text)
In addition to parent fields in the field tree, the DA value can also be inherited from the AcroForm dictionary:
DA
string
(Optional) A document-wide default value for the DA attribute of variable text fields (see 12.7.3.3, “Variable Text”).
(ISO 32000-1, Table 218 – Entries in the interactive form dictionary)
In your PDF
You don't provide a default appearance, and your PDF does not have a default in the AcroForm dictionary.
Thus, strictly speaking, it is not valid at the moment you call acroForm.refreshAppearances(). So PDFBox could reject that call due to missing information.
It works differently, though, as PDFBox provides defaults for certain AcroForm dictionary entries if they are not present, in particular
final String adobeDefaultAppearanceString = "/Helv 0 Tf 0 g ";
// DA entry is required
if (getDefaultAppearance().length() == 0)
{
setDefaultAppearance(adobeDefaultAppearanceString);
}
Unfortunately, though, PDFBox does not ensure that the font Helv used here is in the default resources unless they are also missing completely.
Solutions
I just saw Tilman wrote an answer here, too. You can find solutions to your issue there.
I need to fill a pdf form dynamically from my java web app and I found PDFBox to be really useful except for an issue or challenge am facing when I have multiple fields with same name.
I have 5 fields with same name(lets say 'wcode') in different places on a one page pdf form. This fields have different fonts. Normally when you fill out one field manually the other fields automatically pick the sames value, the same this happens when I fill it using PDFbox except that PDFBox changes all my 5 fields to have same font as the first field to appear in the pdf form.
Here is the code used to fill the field.
PDDocument _pdfDocument = PDDocument.load(new File(originalPdf))
PDAcroForm acroForm = _pdfDocument.getDocumentCatalog().getAcroForm();
PDTextField myCodeField = (PDTextField) acroForm.getField("wcode");
if (myCodeField != null) {
myCodeField .setValue(my.getCode());
}
//Refresh layout && Flatten the document
acroForm.refreshAppearances();
acroForm.flatten();
_pdfDocument.save(outputFile);
I added
acroForm.refreshAppearances();
after some research but that did not change anything.
So if the first 'wcode' field to appear on the pdf form is 6pt all the other 'wcode' fields in the rest the pdf will be 6pt even if I had set them in appearance properties to 12pt.
I am using PDFBox 2.0.5
The issue has been resolved in version PDFBox 2.0.6 released about a month ago.
Check comment on the jira 3837 here