Accessing a COSArray for PDF fields with Apache PDFBox

Accessing a COSArray for PDF fields with Apache PDFBox - java

I'm trying to access all form fields in a PDF file - so I can use code to fill them in - and this is as far as I've gotten:
PDDocumentCatalog pdCatalog = pdf.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
List<PDField> fieldList = pdAcroForm.getFields(); // fieldList.size() = 1
PDField field = fieldList.get(0);
COSDictionary dictionary = field.getCOSObject();
System.out.println("dictionary size = " + dictionary.size());
// my attempt to iterate through fields
for ( Map.Entry<COSName,COSBase> entry : dictionary.entrySet() )
{
COSName key = entry.getKey();
COSBase val = entry.getValue();
if ( val instanceof COSArray )
{
System.out.println("COSArray size = " + ((COSArray)val).size());
}
System.out.println("key = " + key);
System.out.println("val = " + val);
}
which gives an output of:
dictionary size = 3
COSArray size = 2
key = COSName{Kids}
val = COSArray{[COSObject{110, 0}, COSObject{108, 0}]}
key = COSName{T}
val = COSString{form1[0]}
key = COSName{V}
val = COSString{}
Does anyone know how I can access the two COSObjects in the COSArray? I also don't know what the notation COSObject{x, y} means, and can't find any documentation on this. If those are dictionary or array values elements, I also want to know how to access those.

You get the object with get(index) to get the COSObject (an indirect reference) or getObject(index) to get the dereferenced object referenced by the COSObject.
COSObject{110, 0} is the object number and the generation number (usually 0). Open your PDF file with NOTEPAD++ and look for "110 0 obj" to find it, or "110 0 R" to see who references this object.

Related

Read acrofields with variable page size within document with iText

I am using iText to add and read acrofields. But it runs into issue where page size within document is variable.
So for eg. Pdf document with 3 pages -> letter, legal , letter
Its unable to get all acrofields. But if all pages legal or all pages letter,works perfectly
Here is code which i use to read the acrofields.
String pdf = "D:\\1350211.pdf";
PdfReader reader = new PdfReader( pdf );
AcroFields fields = reader.getAcroFields();
Set<String> fldNames = fields.getFields().keySet();
List<AcrofieldModel> lists = new ArrayList<>();
for (String fldName : fldNames) {
List<FieldPosition> position = fields.getFieldPositions(fldName);
float lowerLeftX = position.get(0).position.getLeft();
float lowerLeftY = position.get(0).position.getBottom();
float upperRightX = position.get(0).position.getRight();
float upperRightY = position.get(0).position.getTop();
float fieldLength = Math.abs(upperRightX-lowerLeftX);
AcrofieldModel acrofieldModel = new AcrofieldModel(fldName, fields.getField( fldName ), "(X:"+lowerLeftX + " , Y:"+lowerLeftY +") ", fieldLength);
lists.add(acrofieldModel);
}
return lists;

PDFBox: do PDDocument and PDPage have references to one another?

Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument?
Somewhere in the application I have a list of PDDocuments.
These documents get merged into one new PDDocument:
PDFMergerUtility pdfMerger = new PDFMergerUtility();
PDDocument mergedPDDocument = new PDDocument();
for (PDDocument pdfDocument : documentList) {
pdfMerger.appendDocument(mergedPDDocument, pdfDocument);
}
Then this PdDocument gets split into bundles of 10:
Splitter splitter = new Splitter();
splitter.setSplitAtPage(bundleSize);
List<PDDocument> bundleList = splitter.split(mergedDocument);
My question with this is now:
if I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?
Also, if you have a PDPage object, can you get information from it like, it's pagenumber, ....?
Or can you get this via another way?

Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument?
Unfortunately the PDPage does not contain a reference to its parent PDDocument, but it has a list of all other pages in the document that can be used to navigate between pages without a reference to the parent PDDocument.
If you have a PDPage object, can you get information from it like its page number, or can you get this via another way?
There is a workaround to get information about the position of a PDPage in the document without the PDDocument available. Each PDPage has a dictionary with information about the size of the page, resources, fonts, content, etc. One of these attributes is called Parent, this is an array of Pages that have all the information needed to create a shallow clone of the PDPage using the constructor PDPage(COSDictionary). The pages are in the correct order so the page number can be obtain by the position of the record in the array.
If I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?
Once you merge the document list into a single document all references to the original documents will be lost. You can confirm this by looking at the Parent object inside the PDPage, go to Parent > Kids > COSObject[n] > Parent and see if the number for Parent is the same for all the elements in the array. In this example Parent is COSName {Parent} : 1781256139; for all pages.
COSName {Parent} : COSObject {
COSDictionary {
COSName {Kids} : COSArray {
COSObject {
COSDictionary {
COSName {TrimBox} : COSArray {0; 0; 612; 792;};
COSName {MediaBox} : COSArray {0; 0; 612; 792;};
COSName {CropBox} : COSArray {0; 0; 612; 792;};
COSName {Resources} : COSDictionary {
...
};
COSName {Contents} : COSObject {
...
};
COSName {Parent} : 1781256139;
COSName {StructParents} : COSInt {68};
COSName {ArtBox} : COSArray {0; 0; 612; 792; };
COSName {BleedBox} : COSArray {0; 0; 612; 792; };
COSName {Type} : COSName {Page};
}
}
...
COSName {Count} : COSInt {4};
COSName {Type} : COSName {Pages};
}
};
Source code
I wrote the following code to show how the information from the PDPage dictionary can be used to navigate the pages back and forward and get the page number using the position in the array.
public class PDPageUtils {
public static void main(String[] args) throws InvalidPasswordException, IOException {
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument document = null;
try {
String filename = "src/main/resources/pdf/us-017.pdf";
document = PDDocument.load(new File(filename));
System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
PDPage page = pageIterator.next();
System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
} finally {
if (document != null) {
document.close();
}
}
}
/**
* Returns a <code>ListIterator</code> initialized with the list of pages from
* the dictionary embedded in the specified <code>PDPage</code>. The current
* position of this <code>ListIterator</code> is set to the position of the
* specified <code>PDPage</code>.
*
* #param page the specified <code>PDPage</code>
*
* #see {#link java.util.ListIterator}
* #see {#link org.apache.pdfbox.pdmodel.PDPage}
*/
public static ListIterator<PDPage> listIterator(PDPage page) {
List<PDPage> pages = new LinkedList<PDPage>();
COSDictionary pageDictionary = page.getCOSObject();
COSDictionary parentDictionary = pageDictionary.getCOSDictionary(COSName.PARENT);
COSArray kidsArray = parentDictionary.getCOSArray(COSName.KIDS);
List<? extends COSBase> kidList = kidsArray.toList();
for (COSBase kid : kidList) {
if (kid instanceof COSObject) {
COSObject kidObject = (COSObject) kid;
COSBase type = kidObject.getDictionaryObject(COSName.TYPE);
if (type == COSName.PAGE) {
COSBase kidPageBase = kidObject.getObject();
if (kidPageBase instanceof COSDictionary) {
COSDictionary kidPageDictionary = (COSDictionary) kidPageBase;
pages.add(new PDPage(kidPageDictionary));
}
}
}
}
int index = pages.indexOf(page);
return pages.listIterator(index);
}
}
Sample output
In this example the PDF document has 4 pages and the iterator was initialized with the first page. Notice that the page number is the previousIndex()
System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
PDPage page = pageIterator.next();
System.out.println("page #: " + pageIterator.previousIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 0, Structural Parent Key: 68
page #: 1, Structural Parent Key: 69
page #: 2, Structural Parent Key: 70
page #: 3, Structural Parent Key: 71
You can also navigate backwards by starting from the last page. Notice now that the page number is the nextIndex().
ListIterator<PDPage> pageIterator = listIterator(document.getPage(3));
pageIterator.next();
while (pageIterator.hasPrevious()) {
PDPage page = pageIterator.previous();
System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 3, Structural Parent Key: 71
page #: 2, Structural Parent Key: 70
page #: 1, Structural Parent Key: 69
page #: 0, Structural Parent Key: 68

Getting current LTV of a PDF document with iText7

I am reworking an application which uses iText5 to resign near expiration documents with long term verification. My iText5 implementation looks like this:
//stp represents PdfStamer instance
AcroFields fields = stp.getAcroFields();
List<String> names = fields.getSignatureNames();
boolean result = true;
if ( names.size() == 0 ) {
logger.debug("addVerification(): no signature names");
return result;
}
String sigName = names.get(names.size() - 1);
PdfPKCS7 pkcs7 = fields.verifySignature(sigName);
LtvVerification v = stp.getLtvVerification();
Now, I have most of the code translated into iText7, it looks like so:
SignatureUtil sign = new SignatureUtil(doc);
List<String> names = sign.getSignatureNames();
boolean result = true;
if ( names.size() == 0 ) {
logger.debug("addVerification(): no signature names");
return result;
}
String sigName = names.get(names.size() - 1);
PdfPKCS7 pkcs7 = sign.verifySignature(sigName);
LtvVerification v = stp.getLtvVerification();
LtvVerification v = doc.//what the hell do I do;
I'm stuck on LTV signature. I have document that is already signed and I need to get the current LTV signature. It can be done in iText5 but I didnt find a single method or class that would return me the LTV signature in iText7.
Is there a way to do this?

Using pdfbox to get form field values

I'm using pdfbox for the first time. Now I'm reading something on the website Pdf
Summarizing I have a pdf like this:
only that my file has many and many different component(textField,RadionButton,CheckBox). For this pdf I have to read these values : Mauro,Rossi,MyCompany. For now I wrote the following code:
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
for(PDField pdField : pdAcroForm.getFields()){
System.out.println(pdField.getValue())
}
Is this a correct way to read the value inside the form component?
Any suggestion about this?
Where can I learn other things on pdfbox?

The code you have should work. If you are actually looking to do something with the values, you'll likely need to use some other methods. For example, you can get specific fields using pdAcroForm.getField(<fieldName>):
PDField firstNameField = pdAcroForm.getField("firstName");
PDField lastNameField = pdAcroForm.getField("lastName");
Note that PDField is just a base class. You can cast things to sub classes to get more interesting information from them. For example:
PDCheckbox fullTimeSalary = (PDCheckbox) pdAcroForm.getField("fullTimeSalary");
if(fullTimeSalary.isChecked()) {
log.debug("The person earns a full-time salary");
} else {
log.debug("The person does not earn a full-time salary");
}
As you suggest, you'll find more information at the apache pdfbox website.

The field can be a top-level field. So you need to loop until it is no longer a top-level field, then you can get the value. Code snippet below loops through all the fields and outputs the field names and values.
{
//from your original code
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
//get all fields in form
List<PDField> fields = acroForm.getFields();
System.out.println(fields.size() + " top-level fields were found on the form");
//inspect field values
for (PDField field : fields)
{
processField(field, "|--", field.getPartialName());
}
...
}
private void processField(PDField field, String sLevel, String sParent) throws IOException
{
String partialName = field.getPartialName();
if (field instanceof PDNonTerminalField)
{
if (!sParent.equals(field.getPartialName()))
{
if (partialName != null)
{
sParent = sParent + "." + partialName;
}
}
System.out.println(sLevel + sParent);
for (PDField child : ((PDNonTerminalField)field).getChildren())
{
processField(child, "| " + sLevel, sParent);
}
}
else
{
//field has no child. output the value
String fieldValue = field.getValueAsString();
StringBuilder outputString = new StringBuilder(sLevel);
outputString.append(sParent);
if (partialName != null)
{
outputString.append(".").append(partialName);
}
outputString.append(" = ").append(fieldValue);
outputString.append(", type=").append(field.getClass().getName());
System.out.println(outputString);
}
}

Get word position In document with lucene

I wonder how to get position of a word in document using Lucene
I already generate index files and I want to extract some information from the index such as indexed word, position of the word in document, etc
I created a reader like this :
public void readIndex(Directory indexDir) throws IOException {
IndexReader ir = IndexReader.open(indexDir);
Fields fields = MultiFields.getFields(ir);
System.out.println("TOTAL DOCUMENTS : " + ir.numDocs());
for(String field : fields) {
Terms terms = fields.terms(field);
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
System.out.println("text = " + text.utf8ToString() + "\nfrequency = " + termsEnum.totalTermFreq());
}
}
}
I modified the writer to :
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
FieldType fieldType = new FieldType();
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setIndexed(true);
doc.add(new Field("word", new BufferedReader(new InputStreamReader(fis, "UTF-8")), fieldType));
And I tried to read whether the term has position by calling terms.hasPositions() which return true
But have no idea which function can gives me the position??

Before you try to retrieve the positional information, you've got to make sure that the indexing happened with the positional information enabled in the first place.
TermsEnum.DocsAndPositionsEnum : Get DocsAndPositionsEnum for the current term. Do not call this when the enum is unpositioned. This method will return null if positions were not indexed.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Accessing a COSArray for PDF fields with Apache PDFBox - java

Related

Read acrofields with variable page size within document with iText

PDFBox: do PDDocument and PDPage have references to one another?

Getting current LTV of a PDF document with iText7

Using pdfbox to get form field values

Get word position In document with lucene

Categories

Resources