How to get a input field's title using PDFBox - java

I want to get all fields from pdf file and get all required data: field type, id, default value, title, popup text and so on.
I can get almost all data except title. If I got it right, correct field titles I can find in the pdf content chapter, but how I can match them to fields?
I use this code to get fields information.
try(PDDocument document = Loader.loadPDF(pdfFileBinary)) {
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
if (acroForm == null) {
return Collections.emptyList();
}
return acroForm.getFields()
.stream()
.flatMap(unfoldFunction)
.map(PdfFieldImpl::new)
.collect(Collectors.toList());
} catch (IOException e) {
throw new RuntimeException("Can't parse pdf document");
}
If someone know the solution but with different library, it would be also great.
Sorry if it is too stupid question for you :)

Related

How to place foreign characters to a PDF in itextpdf like 'ő' or 'ű'?

I have a function to place my text to the document into something like a table.
private static void addCenteredParagraph(Document document, float width, String text) {
PdfFont timesNewRomanBold = null;
try {
timesNewRomanBold = PdfFontFactory.createFont(StandardFonts.TIMES_BOLD);
} catch (IOException e) {
LOGGER.error("Failed to create Times New Roman Bold font.");
LOGGER.error(e);
}
List<TabStop> tabStops = new ArrayList<>();
// Create a TabStop at the middle of the page
tabStops.add(new TabStop(width / 2, TabAlignment.CENTER));
// Create a TabStop at the end of the page
tabStops.add(new TabStop(width, TabAlignment.LEFT));
Paragraph p = new Paragraph().addTabStops(tabStops).setFontSize(14);
if (timesNewRomanBold != null) {
p.setFont(timesNewRomanBold);
}
p.add(new Tab()).add(text).add(new Tab());
document.add(p);
}
But my problem is it shows empty characters in the exported PDF instead of the letters ő,Ő,ű,Ű.
The Times New Roman supports these characters, so I think I need to set it to UTF8, but I couldn't find a workaround on Google to do it.
Can someone please explain how to make these characters appear properly on the pdf?
Tried these, but some of them are deprecated functions, or not applicable with my arguments I'm trying to give them, or I don't use Chunk.
Itext PDF writer, Is there any way to allow unicode subscript symbol in the pdf? (Without setTextRise)
How can I set encoding for iText when I insert value to placeholder in pdf form?
Edit: I figured out, that timesNewRomanBold is null, even if I set it to HELVETICA, or COURIER.

PDFBox inconsistent PDTextField behaviour after setValue

PDFBox setValue() is not setting data for each PDTextField. It is saving few fields. It is not working for fields which have similar appearance in getFullyQualifiedName().
Note: field.getFullyQualifiedName() { customdutiesa, customdutiesb, customdutiesc } it is working for customdutiesa, but not working for customdutiesb and customdutiesc etc...
#Test
public void testb3Generator() throws IOException {
File f = new File(inputFile);
outputFile = String.format("%s_b3-3.pdf", "123");
try (PDDocument document = PDDocument.load(f)) {
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
int i = 0;
for (PDField field : acroForm.getFields()) {
i=i+1;
if (field instanceof PDTextField) {
PDTextField textField = (PDTextField) field;
textField.setValue(Integer.toString(i));
}
}
document.getDocumentCatalog().getAcroForm().flatten();
document.save(new File(outputFile));
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
Input pdf link : https://s3-us-west-2.amazonaws.com/kx-filing-docs/b3-3.pdf
Ouput pdf link : https://kx-filing-docs.s3-us-west-2.amazonaws.com/123_b3-3.pdf
The problem is that under certain conditions PDFBox does not construct appearances for fields it sets the value of, and, therefore, during flattening completely forgets the field content:
// in case all tests fail the field will be formatted by acrobat
// when it is opened. See FreedomExpressions.pdf for an example of this.
if (actions == null || actions.getF() == null ||
widget.getCOSObject().getDictionaryObject(COSName.AP) != null)
{
... generate appearance ...
}
(org.apache.pdfbox.pdmodel.interactive.form.AppearanceGeneratorHelper.setAppearanceValue(String))
I.e. if there is a JavaScript action for value formatting associated with the field and no appearance stream is yet present, PDFBox assumes it does not need to create an appearance (and probably would do it wrong anyways as it does not use that formatting action).
In case of a use case later flattening the form, that assumption of PDFBox obviously is wrong.
To force PDFBox to generate appearances for those fields, too, simply remove the actions before setting field values:
if (field instanceof PDTextField) {
PDTextField textField = (PDTextField) field;
textField.setActions(null);
textField.setValue(Integer.toString(i));
}
(from FillAndFlatten test testLikeAbubakarRemoveAction)

Not all fields showing in PDF when using Eclipse and itextpdf

I am using Eclipse to get data from the DB and populate a PDF. I created a form and named the fields. Everything was working fine before my computer crashed and I lost my code. I had to revert back to an older version of my code. I updated my code to populate some new fields that were added on the PDF before my computer crashed. Those new fields won't show the data.
I know the problem is not within getting the data from the DB because I can populate an older field with the new data.
My first thought is that I have to initialize the new fields but I don't know where I would do that. I did a search for a field that is showing but I only saw my code where I populate that field.
QDOB1 is an older field and is populated on the PDF.
dp.mdf("QDOB1", sChildren[1]);
QCHILD1 is a new field and won't populate.
dp.mdf("QCHILD1", sChildren[0]);
I can use sChildren[0] to populate the QDOB1 field.
dp.mdf("QDOB1", sChildren[0]);
Here is my code for dp.mdf which populates the field on the PDF.
public void mdf(String uk, String vl)
{
if ((vl != "") && (vl.trim().length() != 0))
{
AcroFields af = this.ps.getAcroFields();
try
{
af.setField(uk, vl);
}
catch (IOException e)
{
JOptionPane.showMessageDialog(null, e.toString());
System.exit(0);
}
catch (DocumentException e)
{
JOptionPane.showMessageDialog(null, e.toString());
System.exit(0);
}
}
}
Do the fields on the PDF need to be initialized? If so, where do I find the file that does this? I can't find an XML file attached to this project.
I realized the form field names are case sensitive. It now works.

Manipulating acrofields with pdfbox changes encoding of checkboxes onValue

I do some acrofield manipulation for text fields which have parent fields. This works so far, but the form also contains some checkboxes, the will not be changed. But when I store the manipulated pdf to disk and inspect the value of the checkbox, i can see that the value of cb_a.0 has been changed from ÄÖÜ?ß to ?????
My further processing fails because of this unintended change, any idea how to prevent that?
My testcase
#Test
public void changeBoxedFieldsToOne() throws IOException {
File encodingPdfFile = new File(classLoader.getResource("./prefill/TestFormEncoding.pdf").getFile());
byte[] encodingPdfByte = Files.readAllBytes(encodingPdfFile.toPath());
PdfAcrofieldManipulator pdfMani = new PdfAcrofieldManipulator(encodingPdfByte);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() > 0);
pdfMani.changeBoxedFieldsToOne();
byte[] changedPdf = pdfMani.savePdf();
Files.write(Paths.get("./build/changeBoxedFieldsToOne.pdf"), changedPdf);
pdfMani = new PdfAcrofieldManipulator(changedPdf);
assertTrue(pdfMani.getTextFieldsWithMoreThan2Children().size() == 0);
}
public void changeBoxedFieldsToOne() {
PDDocumentCatalog docCatalog = pdDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDNonTerminalField> textFieldWithMoreThan2Childrens = getTextFieldsWithMoreThan2Children();
for (PDField field : textFieldWithMoreThan2Childrens) {
int amountOfChilds = ((PDNonTerminalField) field).getChildren().size();
String currentFieldName = field.getPartialName();
LOG.info("merging fields of fieldnam {0} to one field", currentFieldName);
PDField firstChild = getChildWithPartialName((PDNonTerminalField) field, "0");
if (firstChild == null ) {
LOG.debug("found field which has a dot but starts not with 0, skipping this field");
continue;
}
PDField lastChild = getChildWithPartialName((PDNonTerminalField) field, Integer.toString(amountOfChilds - 1));
PDPage pageWhichContainsField = firstChild.getWidgets().get(0).getPage();
try {
removeField(pdDocument, currentFieldName);
} catch (IOException e) {
LOG.error("Error while removing field {0}", currentFieldName, e);
}
PDField newField = creatNewField(acroForm, field, firstChild, lastChild, pageWhichContainsField);
acroForm.getFields().add(newField);
PDAnnotationWidget newFieldWidget = createWidgetForField(newField, pageWhichContainsField, firstChild, lastChild);
try {
pageWhichContainsField.getAnnotations().add(newFieldWidget);
} catch (IOException e) {
LOG.error("error while adding new field to page");
}
}
}
public byte[] savePdf() throws IOException {
try (final ByteArrayOutputStream out = new ByteArrayOutputStream()) {
//pdDocument.saveIncremental(out);
pdDocument.save(out);
pdDocument.close();
return out.toByteArray();
}
}
I am using PDFBox 2.0.8
Here is the source PDF:https://ufile.io/gr01f or here https://www.file-upload.net/download-12928052/TestFormEncoding.pdf.html
Here the output: https://ufile.io/k8cr3 or here https://www.file-upload.net/download-12928049/changeBoxedFieldsToOne.pdf.html
This indeed is a bug in PDFBox: PDFBox cannot properly handle PDF Name objects containing bytes with values outside the US_ASCII range (in particular outside the range 0..127, and your umlauts are outside).
The first error in PDF Name handling is that PDFBox internally represents them as strings after a mixed UTF-8 / CP-1252 decoding strategy. This is wrong, according to the PDF specification a name object is an atomic symbol uniquely defined by a sequence of any characters (8-bit values) except null (character code 0). [...]
Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a PDF processor. However, occasionally the need arises to treat a name object as text, such as one that represents a font name [...], a colourant name in a Separation or DeviceN colour space, or a structure type [...]
In such situations, the sequence of bytes making up the name object should be interpreted according to UTF-8, a variable-length byte-encoded representation.
Thus, it generally does not make sense to treat a name as anything else than a byte sequence. Only names used in certain contexts should be meaningful as UTF-8 encoded strings.
Furthermore, a mixed UTF-8 / CP-1252 decoding strategy, i.e. one that first tries to decode using UTF-8 and in case of failure tries again with CP-1252, can create the same string representation for different name entities, so this can indeed falsify by making unequal names equal.
This is not the problem in your case, though, the names you used can be interpreted.
The second error is, though, that while serializing the PDF it only properly encodes the characters in the strings representing names which are from US_ASCII, all else are replaced by '?':
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
[...]
}
}
(from org.apache.pdfbox.cos.COSName.writePDF(OutputStream))
This is where your checkbox values (which internally are represented by PDF Name objects) get damaged beyond repair...
A more simple example to show the problem is this:
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
document.getDocumentCatalog().getCOSObject().setString(COSName.getPDFName("äöüß"), "äöüß");
document.save(new File(RESULT_FOLDER, "non-ascii-name.pdf"));
document.close();
In the result the catalog with the custom entry looks like this:
1 0 obj
<<
/Type /Catalog
/Version /1.4
/Pages 2 0 R
/#3F#3F#3F#3F <E4F6FCDF>
>>
In the name key all characters are replaced by '?' in hex encoded form (#3F) while in the string value the characters are appropriately encoded.
After a bit of searching I stumbled over an answer on this topic I gave almost two years ago. Back then the PDF Name object bytes were always interpreted as UTF-8 encoded which led to issues in that question.
As a consequence the issue PDFBOX-3347 was created. To resolve it the mixed UTF-8 / CP-1252 decoding strategy was introduced. As expressed above, though, I'm not a friend of that strategy.
In that stack overflow answer I also already discussed the problems related to the use of US_ASCII during PDF serialization but that aspect has not yet been addressed at all.
Another related issue is PDFBOX-3519 but its resolution also was reduced to trying to fix the parsing of PDF Names, ignoring the serialization of it.
Yet another related issue is PDFBOX-2836.

Why is my form being flattened without calling the flattenFields method?

I am testing my method with this form https://help.adobe.com/en_US/Acrobat/9.0/Samples/interactiveform_enabled.pdf
It is being called like so:
Pdf.editForm("./src/main/resources/pdfs/interactiveform_enabled.pdf", "./src/main/resources/pdfs/FILLEDOUT.pdf"));
where Pdf is just a worker class and editForm is a static method.
The editForm method looks like this:
public static int editForm(String inputPath, String outputPath) {
try {
PdfDocument pdf = new PdfDocument(new PdfReader(inputPath), new PdfWriter(outputPath));
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
Map<String, PdfFormField> m = form.getFormFields();
for (String s : m.keySet()) {
if (s.equals("Name_First")) {
m.get(s).setValue("Tristan");
}
if (s.equals("BACHELORS DEGREE")) {
m.get(s).setValue("Off"); // On or Off
}
if (s.equals("Sex")) {
m.get(s).setValue("FEMALE");
}
System.out.println(s);
}
pdf.close();
logger.info("Completed");
} catch (IOException e) {
logger.error("Unable to fill form " + outputPath + "\n\t" + e);
return 1;
}
return 0;
}
Unfortunately the FILLEDOUT.pdf file is no longer a form after calling this method. Am I doing something wrong?
I was using this resource for guidance. Notice how I am not calling the form.flattenFields(). If I do call that method however, I get an error of java.lang.IllegalArgumentException.
Thank you for your time.
Your form is Reader-enabled, i.e. it contains a usage rights digital signature by a key and certificate issued by Adobe to indicate to a regular Adobe Reader that it shall activate a number of additional features when operating on that very PDF.
If you stamp the file as in your original code, the existing PDF objects will get re-arranged and slightly changed. This breaks the usage rights signature, and Adobe Reader, recognizing that, disclaims "The document has been changed since it was created and use of extended features is no longer available."
If you stamp the file in append mode, though, the changes are appended to the PDF as an incremental update. Thus, the signature still correctly signs its original byte range and Adobe Reader does not complain.
To activate append mode, use StampingProperties when you create your PdfDocument:
PdfDocument pdf = new PdfDocument(new PdfReader(inputPath), new PdfWriter(outputPath), new StampingProperties().useAppendMode());
(Tested with iText 7.1.1-SNAPSHOT and Adobe Acrobat Reader DC version 2018.009.20050)
By the way, Adobe Reader does not merely check the signature, it also tries to determine whether the changes in the incremental update don't go beyond the scope of the additional features activated by the usage rights signature.
Otherwise you could simply take a small Reader-enabled PDF and in append mode replace all existing pages by your own content of choice. This of course is not in Adobe's interest...
The filled in PDF is still an AcroForm, otherwise the example below would result in the same PDF twice.
public class Main {
public static final String SRC = "src/main/resources/interactiveform_enabled.pdf";
public static final String DEST = "results/filled_form.pdf";
public static final String DEST2 = "results/filled_form_second_time.pdf";
public static void main(String[] args) throws Exception {
File file = new File(DEST);
file.getParentFile().mkdirs();
Main main = new Main();
Map<String, String> data1 = new HashMap<>();
data1.put("Name_First", "Tristan");
data1.put("BACHELORS DEGREE", "Off");
main.fillPdf(SRC, DEST, data1, false);
Map<String, String> data2 = new HashMap<>();
data2.put("Sex", "FEMALE");
main.fillPdf(DEST, DEST2, data2, false);
}
private void fillPdf(String src, String dest, Map<String, String> data, boolean flatten) {
try {
PdfDocument pdf = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
PdfAcroForm form = PdfAcroForm.getAcroForm(pdf, true);
//Delete print field from acroform because it is defined in the contentstream not in the formfields
form.removeField("Print");
Map<String, PdfFormField> m = form.getFormFields();
for (String d : data.keySet()) {
for (String s : m.keySet()) {
if(s.equals(d)){
m.get(s).setValue(data.get(d));
}
}
}
if(flatten){
form.flattenFields();
}
pdf.close();
System.out.println("Completed");
} catch (IOException e) {
System.out.println("Unable to fill form " + dest + "\n\t" + e);
}
}
}
The issue you are facing has to do with the 'reader enabled forms'.
What it boils down to is that the PDF file that is initially fed to your program is reader enabled. Hence you can open the PDF in Adobe Reader and fill in the form. This allows Acrobat users to extend the behaviour of Adobe Reader.
Once the PDF is filled in and closed using iText it saves the PDF as 'not reader-extended'.
This makes it so that the AcroForm can still be filled using iText but when you open the PDF using Adobe Reader the extended functionality you see in the original PDF is gone. But this does not mean the form is flattened.
iText cannot make a form reader enabled, as a matter of fact, the only way to create a reader enabled form is using Acrobat Professional. This is how Acrobat and Adobe Reader interact and it is not something iText can imitate or solve. You can find some more info and a possible solution on this link.
The IllegalArgumentException you get when you call the form.flattenFields() method is because of the way the PDF document was constructed.
The "Print form" button should have been defined in the AcroForm, yet it is defined in the contentstream of the PDF, meaning the button in the AcroForm has an empty text value, and this is what causes the exception.
You can fix this by removing the print field from the AcroForm before you flatten.
IllegalArgumentException issue has been fixed in iText 7.1.5.

Categories