Using pdfbox to get form field values - java

I'm using pdfbox for the first time. Now I'm reading something on the website Pdf
Summarizing I have a pdf like this:
only that my file has many and many different component(textField,RadionButton,CheckBox). For this pdf I have to read these values : Mauro,Rossi,MyCompany. For now I wrote the following code:
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
for(PDField pdField : pdAcroForm.getFields()){
System.out.println(pdField.getValue())
}
Is this a correct way to read the value inside the form component?
Any suggestion about this?
Where can I learn other things on pdfbox?

The code you have should work. If you are actually looking to do something with the values, you'll likely need to use some other methods. For example, you can get specific fields using pdAcroForm.getField(<fieldName>):
PDField firstNameField = pdAcroForm.getField("firstName");
PDField lastNameField = pdAcroForm.getField("lastName");
Note that PDField is just a base class. You can cast things to sub classes to get more interesting information from them. For example:
PDCheckbox fullTimeSalary = (PDCheckbox) pdAcroForm.getField("fullTimeSalary");
if(fullTimeSalary.isChecked()) {
log.debug("The person earns a full-time salary");
} else {
log.debug("The person does not earn a full-time salary");
}
As you suggest, you'll find more information at the apache pdfbox website.

The field can be a top-level field. So you need to loop until it is no longer a top-level field, then you can get the value. Code snippet below loops through all the fields and outputs the field names and values.
{
//from your original code
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
//get all fields in form
List<PDField> fields = acroForm.getFields();
System.out.println(fields.size() + " top-level fields were found on the form");
//inspect field values
for (PDField field : fields)
{
processField(field, "|--", field.getPartialName());
}
...
}
private void processField(PDField field, String sLevel, String sParent) throws IOException
{
String partialName = field.getPartialName();
if (field instanceof PDNonTerminalField)
{
if (!sParent.equals(field.getPartialName()))
{
if (partialName != null)
{
sParent = sParent + "." + partialName;
}
}
System.out.println(sLevel + sParent);
for (PDField child : ((PDNonTerminalField)field).getChildren())
{
processField(child, "| " + sLevel, sParent);
}
}
else
{
//field has no child. output the value
String fieldValue = field.getValueAsString();
StringBuilder outputString = new StringBuilder(sLevel);
outputString.append(sParent);
if (partialName != null)
{
outputString.append(".").append(partialName);
}
outputString.append(" = ").append(fieldValue);
outputString.append(", type=").append(field.getClass().getName());
System.out.println(outputString);
}
}

Related

Accessing a COSArray for PDF fields with Apache PDFBox

I'm trying to access all form fields in a PDF file - so I can use code to fill them in - and this is as far as I've gotten:
PDDocumentCatalog pdCatalog = pdf.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
List<PDField> fieldList = pdAcroForm.getFields(); // fieldList.size() = 1
PDField field = fieldList.get(0);
COSDictionary dictionary = field.getCOSObject();
System.out.println("dictionary size = " + dictionary.size());
// my attempt to iterate through fields
for ( Map.Entry<COSName,COSBase> entry : dictionary.entrySet() )
{
COSName key = entry.getKey();
COSBase val = entry.getValue();
if ( val instanceof COSArray )
{
System.out.println("COSArray size = " + ((COSArray)val).size());
}
System.out.println("key = " + key);
System.out.println("val = " + val);
}
which gives an output of:
dictionary size = 3
COSArray size = 2
key = COSName{Kids}
val = COSArray{[COSObject{110, 0}, COSObject{108, 0}]}
key = COSName{T}
val = COSString{form1[0]}
key = COSName{V}
val = COSString{}
Does anyone know how I can access the two COSObjects in the COSArray? I also don't know what the notation COSObject{x, y} means, and can't find any documentation on this. If those are dictionary or array values elements, I also want to know how to access those.
You get the object with get(index) to get the COSObject (an indirect reference) or getObject(index) to get the dereferenced object referenced by the COSObject.
COSObject{110, 0} is the object number and the generation number (usually 0). Open your PDF file with NOTEPAD++ and look for "110 0 obj" to find it, or "110 0 R" to see who references this object.

Java + MongoDB: how get a nested field value using complete path?

I have this path for a MongoDB field main.inner.leaf and every field couldn't be present.
In Java I should write, avoiding null:
String leaf = "";
if (document.get("main") != null &&
document.get("main", Document.class).get("inner") != null) {
leaf = document.get("main", Document.class)
.get("inner", Document.class).getString("leaf");
}
In this simple example I set only 3 levels: main, inner and leaf but my documents are deeper.
So is there a way avoiding me writing all these null checks?
Like this:
String leaf = document.getString("main.inner.leaf", "");
// "" is the deafult value if one of the levels doesn't exist
Or using a third party library:
String leaf = DocumentUtils.getNullCheck("main.inner.leaf", "", document);
Many thanks.
Since the intermediate attributes are optional you really have to access the leaf value in a null safe manner.
You could do this yourself using an approach like ...
if (document.containsKey("main")) {
Document _main = document.get("main", Document.class);
if (_main.containsKey("inner")) {
Document _inner = _main.get("inner", Document.class);
if (_inner.containsKey("leaf")) {
leafValue = _inner.getString("leaf");
}
}
}
Note: this could be wrapped up in a utility to make it more user friendly.
Or use a thirdparty library such as Commons BeanUtils.
But, you cannot avoid null safe checks since the document structure is such that the intermediate levels might be null. All you can do is to ease the burden of handling the null safety.
Here's an example test case showing both approaches:
#Test
public void readNestedDocumentsWithNullSafety() throws IllegalAccessException, NoSuchMethodException, InvocationTargetException {
Document inner = new Document("leaf", "leafValue");
Document main = new Document("inner", inner);
Document fullyPopulatedDoc = new Document("main", main);
assertThat(extractLeafValueManually(fullyPopulatedDoc), is("leafValue"));
assertThat(extractLeafValueUsingThirdPartyLibrary(fullyPopulatedDoc, "main.inner.leaf", ""), is("leafValue"));
Document emptyPopulatedDoc = new Document();
assertThat(extractLeafValueManually(emptyPopulatedDoc), is(""));
assertThat(extractLeafValueUsingThirdPartyLibrary(emptyPopulatedDoc, "main.inner.leaf", ""), is(""));
Document emptyInner = new Document();
Document partiallyPopulatedMain = new Document("inner", emptyInner);
Document partiallyPopulatedDoc = new Document("main", partiallyPopulatedMain);
assertThat(extractLeafValueManually(partiallyPopulatedDoc), is(""));
assertThat(extractLeafValueUsingThirdPartyLibrary(partiallyPopulatedDoc, "main.inner.leaf", ""), is(""));
}
private String extractLeafValueUsingThirdPartyLibrary(Document document, String path, String defaultValue) {
try {
Object value = PropertyUtils.getNestedProperty(document, path);
return value == null ? defaultValue : value.toString();
} catch (Exception ex) {
return defaultValue;
}
}
private String extractLeafValueManually(Document document) {
Document inner = getOrDefault(getOrDefault(document, "main"), "inner");
return inner.get("leaf", "");
}
private Document getOrDefault(Document document, String key) {
if (document.containsKey(key)) {
return document.get(key, Document.class);
} else {
return new Document();
}
}

Does PDFBox allow to remove one field from AcroForm?

I am using Apache PDFBox 2.0.8 and trying to remove one field. But can not find the way to do it, like I can do with iText: PdfStamper.getAcroFields().removeField("signature3").
What I am tying to do. Initially I have template PDF with 3 Digital Signatures. In some cases I need just 2 signatures, so it this case I need to remove 3rd signature from the template. And seems like I can't do it with PDFBox, close thing I found is flattening this field, but that problem is if a flatten particular PDField (not whole form, but just one field) - all other signatures are loosing their functionality, looks like they are getting flattened as well.
Here is code that does it:
PDDocument document = PDDocument.load(file);
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = documentCatalog.getAcroForm();
List<PDField> flattenList = new ArrayList<>();
for (PDField field : acroForm.getFieldTree()) {
if (field instanceof PDSignatureField && "signature3".equals(field.getFullyQualifiedName())) {
flattenList.add(field);
}
}
acroForm.flatten(flattenList, true);
document.save(dest);
document.close();
As Tilman already mentioned in a comment, PDFBox doesn't have a method to remove a field from the field tree. Nonetheless it has methods to manipulate the underlying PDF structure, so one can write such a method oneself, e.g. like this:
PDField removeField(PDDocument document, String fullFieldName) throws IOException {
PDDocumentCatalog documentCatalog = document.getDocumentCatalog();
PDAcroForm acroForm = documentCatalog.getAcroForm();
if (acroForm == null) {
System.out.println("No form defined.");
return null;
}
PDField targetField = null;
for (PDField field : acroForm.getFieldTree()) {
if (fullFieldName.equals(field.getFullyQualifiedName())) {
targetField = field;
break;
}
}
if (targetField == null) {
System.out.println("Form does not contain field with given name.");
return null;
}
PDNonTerminalField parentField = targetField.getParent();
if (parentField != null) {
List<PDField> childFields = parentField.getChildren();
boolean removed = false;
for (PDField field : childFields)
{
if (field.getCOSObject().equals(targetField.getCOSObject())) {
removed = childFields.remove(field);
parentField.setChildren(childFields);
break;
}
}
if (!removed)
System.out.println("Inconsistent form definition: Parent field does not reference the target field.");
} else {
List<PDField> rootFields = acroForm.getFields();
boolean removed = false;
for (PDField field : rootFields)
{
if (field.getCOSObject().equals(targetField.getCOSObject())) {
removed = rootFields.remove(field);
break;
}
}
if (!removed)
System.out.println("Inconsistent form definition: Root fields do not include the target field.");
}
removeWidgets(targetField);
return targetField;
}
void removeWidgets(PDField targetField) throws IOException {
if (targetField instanceof PDTerminalField) {
List<PDAnnotationWidget> widgets = ((PDTerminalField)targetField).getWidgets();
for (PDAnnotationWidget widget : widgets) {
PDPage page = widget.getPage();
if (page != null) {
List<PDAnnotation> annotations = page.getAnnotations();
boolean removed = false;
for (PDAnnotation annotation : annotations) {
if (annotation.getCOSObject().equals(widget.getCOSObject()))
{
removed = annotations.remove(annotation);
break;
}
}
if (!removed)
System.out.println("Inconsistent annotation definition: Page annotations do not include the target widget.");
} else {
System.out.println("Widget annotation does not have an associated page; cannot remove widget.");
// TODO: In this case iterate all pages and try to find and remove widget in all of them
}
}
} else if (targetField instanceof PDNonTerminalField) {
List<PDField> childFields = ((PDNonTerminalField)targetField).getChildren();
for (PDField field : childFields)
removeWidgets(field);
} else {
System.out.println("Target field is neither terminal nor non-terminal; cannot remove widgets.");
}
}
(RemoveField helper methods removeField and removeWidgets)
One can apply this to a document and field like this:
PDDocument document = PDDocument.load(SOURCE_PDF);
PDField field = removeField(document, "Signature1");
Assert.assertNotNull("Field not found", field);
document.save(TARGET_PDF);
document.close();
(RemoveField test testRemoveInvisibleSignature)
PS: I am not sure how much form related information PDFBox actually caches somewhere. Thus, I would propose not to manipulate the form information any further in the same document manipulation session, at least not without tests.
PPS: You find a TODO in the removeWidgets helper method. If the method outputs "Widget annotation does not have an associated page; cannot remove widget", you'll have to add the missing code.
Thanks to #mkl, I managed to do so with a shorter implementation using version pdfbox-3.0.0-RC1. in this case, to hide a button (check):
var check = (PDPushButton) pdAcroForm.getField(name);
List<PDField> fields = pdAcroForm.getFields();
fields.removeIf(x -> x.getCOSObject().equals(check.getCOSObject()));
pdAcroForm.setFields(fields);
check.getWidgets().forEach(widget -> widget.setNoView(true));

PDFBox - How to access whole form

EDITED
I'm using some code to extract fields and data of a signed PDF. It's a signed PDF with several forms with some fields and their values.
But I'm only getting the signature field/value with the current code:
/**
* This will print all the fields from the document.
*
* #param pdfDocument The PDF to get the fields from.
*
* #throws IOException If there is an error getting the fields.
*/
public void printFields(PDDocument pdfDocument) throws IOException
{
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
List<PDField> fields = acroForm.getFields();
System.out.println(fields.size() + " top-level fields were found on the form");
for (PDField field : fields)
{
processField(field, "|--", field.getPartialName());
}
}
private void processField(PDField field, String sLevel, String sParent) throws IOException
{
String partialName = field.getPartialName();
if (field instanceof PDNonTerminalField)
{
if (!sParent.equals(field.getPartialName()))
{
if (partialName != null)
{
sParent = sParent + "." + partialName;
}
}
System.out.println(sLevel + sParent);
for (PDField child : ((PDNonTerminalField)field).getChildren())
{
processField(child, "| " + sLevel, sParent);
}
}
else
{
String fieldValue = field.getValueAsString();
StringBuilder outputString = new StringBuilder(sLevel);
outputString.append(sParent);
if (partialName != null)
{
outputString.append(".").append(partialName);
}
outputString.append(" = ").append(fieldValue);
outputString.append(", type=").append(field.getClass().getName());
System.out.println(outputString);
}
}
But getting
1 top-level fields were found on the form
|--ENVELOPEID_EE81E3AFDC0143968B9D253C5DEA7A2B.ENVELOPEID_EE81E3AFDC0143968B9D253C5DEA7A2B = org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature#1c53fd30, type=org.apache.pdfbox.pdmodel.interactive.form.PDSignatureField
How can I get the contents of the PDF?
Regards

Get word position In document with lucene

I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??
Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.

Categories