PDFBox: do PDDocument and PDPage have references to one another?

PDFBox: do PDDocument and PDPage have references to one another? - java

Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument?
Somewhere in the application I have a list of PDDocuments.
These documents get merged into one new PDDocument:
PDFMergerUtility pdfMerger = new PDFMergerUtility();
PDDocument mergedPDDocument = new PDDocument();
for (PDDocument pdfDocument : documentList) {
pdfMerger.appendDocument(mergedPDDocument, pdfDocument);
}
Then this PdDocument gets split into bundles of 10:
Splitter splitter = new Splitter();
splitter.setSplitAtPage(bundleSize);
List<PDDocument> bundleList = splitter.split(mergedDocument);
My question with this is now:
if I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?
Also, if you have a PDPage object, can you get information from it like, it's pagenumber, ....?
Or can you get this via another way?

Does a PDPage object contains a reference to the PDDocument to which it belongs? In other words, does a PDPage has knowledge of its PDDocument?
Unfortunately the PDPage does not contain a reference to its parent PDDocument, but it has a list of all other pages in the document that can be used to navigate between pages without a reference to the parent PDDocument.
If you have a PDPage object, can you get information from it like its page number, or can you get this via another way?
There is a workaround to get information about the position of a PDPage in the document without the PDDocument available. Each PDPage has a dictionary with information about the size of the page, resources, fonts, content, etc. One of these attributes is called Parent, this is an array of Pages that have all the information needed to create a shallow clone of the PDPage using the constructor PDPage(COSDictionary). The pages are in the correct order so the page number can be obtain by the position of the record in the array.
If I loop over the pages of these splitted PDDocuments in the list, is there a way to know to which PDDocument a page originally belonged?
Once you merge the document list into a single document all references to the original documents will be lost. You can confirm this by looking at the Parent object inside the PDPage, go to Parent > Kids > COSObject[n] > Parent and see if the number for Parent is the same for all the elements in the array. In this example Parent is COSName {Parent} : 1781256139; for all pages.
COSName {Parent} : COSObject {
COSDictionary {
COSName {Kids} : COSArray {
COSObject {
COSDictionary {
COSName {TrimBox} : COSArray {0; 0; 612; 792;};
COSName {MediaBox} : COSArray {0; 0; 612; 792;};
COSName {CropBox} : COSArray {0; 0; 612; 792;};
COSName {Resources} : COSDictionary {
...
};
COSName {Contents} : COSObject {
...
};
COSName {Parent} : 1781256139;
COSName {StructParents} : COSInt {68};
COSName {ArtBox} : COSArray {0; 0; 612; 792; };
COSName {BleedBox} : COSArray {0; 0; 612; 792; };
COSName {Type} : COSName {Page};
}
}
...
COSName {Count} : COSInt {4};
COSName {Type} : COSName {Pages};
}
};
Source code
I wrote the following code to show how the information from the PDPage dictionary can be used to navigate the pages back and forward and get the page number using the position in the array.
public class PDPageUtils {
public static void main(String[] args) throws InvalidPasswordException, IOException {
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument document = null;
try {
String filename = "src/main/resources/pdf/us-017.pdf";
document = PDDocument.load(new File(filename));
System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
PDPage page = pageIterator.next();
System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
} finally {
if (document != null) {
document.close();
}
}
}
/**
* Returns a <code>ListIterator</code> initialized with the list of pages from
* the dictionary embedded in the specified <code>PDPage</code>. The current
* position of this <code>ListIterator</code> is set to the position of the
* specified <code>PDPage</code>.
*
* #param page the specified <code>PDPage</code>
*
* #see {#link java.util.ListIterator}
* #see {#link org.apache.pdfbox.pdmodel.PDPage}
*/
public static ListIterator<PDPage> listIterator(PDPage page) {
List<PDPage> pages = new LinkedList<PDPage>();
COSDictionary pageDictionary = page.getCOSObject();
COSDictionary parentDictionary = pageDictionary.getCOSDictionary(COSName.PARENT);
COSArray kidsArray = parentDictionary.getCOSArray(COSName.KIDS);
List<? extends COSBase> kidList = kidsArray.toList();
for (COSBase kid : kidList) {
if (kid instanceof COSObject) {
COSObject kidObject = (COSObject) kid;
COSBase type = kidObject.getDictionaryObject(COSName.TYPE);
if (type == COSName.PAGE) {
COSBase kidPageBase = kidObject.getObject();
if (kidPageBase instanceof COSDictionary) {
COSDictionary kidPageDictionary = (COSDictionary) kidPageBase;
pages.add(new PDPage(kidPageDictionary));
}
}
}
}
int index = pages.indexOf(page);
return pages.listIterator(index);
}
}
Sample output
In this example the PDF document has 4 pages and the iterator was initialized with the first page. Notice that the page number is the previousIndex()
System.out.println("listIterator(PDPage)");
ListIterator<PDPage> pageIterator = listIterator(document.getPage(0));
while (pageIterator.hasNext()) {
PDPage page = pageIterator.next();
System.out.println("page #: " + pageIterator.previousIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 0, Structural Parent Key: 68
page #: 1, Structural Parent Key: 69
page #: 2, Structural Parent Key: 70
page #: 3, Structural Parent Key: 71
You can also navigate backwards by starting from the last page. Notice now that the page number is the nextIndex().
ListIterator<PDPage> pageIterator = listIterator(document.getPage(3));
pageIterator.next();
while (pageIterator.hasPrevious()) {
PDPage page = pageIterator.previous();
System.out.println("page #: " + pageIterator.nextIndex() + ", Structural Parent Key: " + page.getStructParents());
}
listIterator(PDPage)
page #: 3, Structural Parent Key: 71
page #: 2, Structural Parent Key: 70
page #: 1, Structural Parent Key: 69
page #: 0, Structural Parent Key: 68

Related

itext 7 pdf how to prevent text overflow on right side of the page

I am using itextpdf 7 (7.2.0) to create a pdf file. However even though the TOC part is rendered very well, in the content part the text overflows. Here is my code that generates the pdf:
public class Main {
public static void main(String[] args) throws IOException {
PdfWriter writer = new PdfWriter("fiftyfourthPdf.pdf");
PdfDocument pdf = new PdfDocument(writer);
Document document = new Document(pdf, PageSize.A4,false);
//document.setMargins(30,10,36,10);
// Create a PdfFont
PdfFont font = PdfFontFactory.createFont(StandardFonts.TIMES_ROMAN,"Cp1254");
document
.setTextAlignment(TextAlignment.JUSTIFIED)
.setFont(font)
.setFontSize(11);
PdfOutline outline = null;
java.util.List<AbstractMap.SimpleEntry<String, AbstractMap.SimpleEntry<String, Integer>>> toc = new ArrayList<>();
for(int i=0;i<5000;i++){
String line = "This is paragraph " + String.valueOf(i+1)+ " ";
line = line.concat(line).concat(line).concat(line).concat(line).concat(line);
Paragraph p = new Paragraph(line);
p.setKeepTogether(true);
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
//PROCESS FOR TOC
String name = "para " + String.valueOf(i+1);
outline = createOutline(outline,pdf,line ,name );
AbstractMap.SimpleEntry<String, Integer> titlePage = new AbstractMap.SimpleEntry(line, pdf.getNumberOfPages());
p
.setFont(font)
.setFontSize(12)
//.setKeepWithNext(true)
.setDestination(name)
// Add the current page number to the table of contents list
.setNextRenderer(new UpdatePageRenderer(p));
toc.add(new AbstractMap.SimpleEntry(name, titlePage));
}
int contentPageNumber = pdf.getNumberOfPages();
for (int i = 1; i <= contentPageNumber; i++) {
// Write aligned text to the specified by parameters point
document.showTextAligned(new Paragraph(String.format("Sayfa %s / %s", i, contentPageNumber)).setFontSize(10),
559, 26, i, TextAlignment.RIGHT, VerticalAlignment.MIDDLE, 0);
}
//BEGINNING OF TOC
document.add(new AreaBreak());
Paragraph p = new Paragraph("Table of Contents")
.setFont(font)
.setDestination("toc");
document.add(p);
java.util.List<TabStop> tabStops = new ArrayList<>();
tabStops.add(new TabStop(580, TabAlignment.RIGHT, new DottedLine()));
for (AbstractMap.SimpleEntry<String, AbstractMap.SimpleEntry<String, Integer>> entry : toc) {
AbstractMap.SimpleEntry<String, Integer> text = entry.getValue();
p = new Paragraph()
.addTabStops(tabStops)
.add(text.getKey())
.add(new Tab())
.add(String.valueOf(text.getValue()))
.setAction(PdfAction.createGoTo(entry.getKey()));
document.add(p);
}
// Move the table of contents to the first page
int tocPageNumber = pdf.getNumberOfPages();
for (int i = 1; i <= tocPageNumber; i++) {
// Write aligned text to the specified by parameters point
document.showTextAligned(new Paragraph("\n footer text\n second line\nthird line").setFontColor(ColorConstants.RED).setFontSize(8),
300, 26, i, TextAlignment.CENTER, VerticalAlignment.MIDDLE, 0);
}
document.flush();
for(int z = 0; z< (tocPageNumber - contentPageNumber ); z++){
pdf.movePage(tocPageNumber,1);
pdf.getPage(1).setPageLabel(PageLabelNumberingStyle.UPPERCASE_LETTERS,
null, 1);
}
//pdf.movePage(tocPageNumber, 1);
// Add page labels
/*pdf.getPage(1).setPageLabel(PageLabelNumberingStyle.UPPERCASE_LETTERS,
null, 1);*/
pdf.getPage(tocPageNumber - contentPageNumber + 1).setPageLabel(PageLabelNumberingStyle.DECIMAL_ARABIC_NUMERALS,
null, 1);
document.close();
}
private static PdfOutline createOutline(PdfOutline outline, PdfDocument pdf, String title, String name) {
if (outline == null) {
outline = pdf.getOutlines(false);
outline = outline.addOutline(title);
outline.addDestination(PdfDestination.makeDestination(new PdfString(name)));
} else {
PdfOutline kid = outline.addOutline(title);
kid.addDestination(PdfDestination.makeDestination(new PdfString(name)));
}
return outline;
}
private static class UpdatePageRenderer extends ParagraphRenderer {
protected AbstractMap.SimpleEntry<String, Integer> entry;
public UpdatePageRenderer(Paragraph modelElement, AbstractMap.SimpleEntry<String, Integer> entry) {
super(modelElement);
this.entry = entry;
}
public UpdatePageRenderer(Paragraph modelElement) {
super(modelElement);
}
#Override
public LayoutResult layout(LayoutContext layoutContext) {
LayoutResult result = super.layout(layoutContext);
//entry.setValue(layoutContext.getArea().getPageNumber());
if (result.getStatus() != LayoutResult.FULL) {
if (null != result.getOverflowRenderer()) {
result.getOverflowRenderer().setProperty(
Property.LEADING,
result.getOverflowRenderer().getModelElement().getDefaultProperty(Property.LEADING));
} else {
// if overflow renderer is null, that could mean that the whole renderer will overflow
setProperty(
Property.LEADING,
result.getOverflowRenderer().getModelElement().getDefaultProperty(Property.LEADING));
}
}
return result;
}
#Override
// If not overriden, the default renderer will be used for the overflown part of the corresponding paragraph
public IRenderer getNextRenderer() {
return new UpdatePageRenderer((Paragraph) this.getModelElement());
}
}
}
Here are the screen shots of TOC part and content part :
TOC :
Content :
What am I missing? Thank you all for your help.
UPDATE
When I add the line below it renders with no overflow but the page margins of TOC and content part differ (the TOC margin is way more than the content margin). See the picture attached please :
document.setMargins(30,60,36,20);
Right Margin difference between TOC and content:
UPDATE 2 :
When I comment the line
document.setMargins(30,60,36,20);
and set the font size on line :
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
to 12 then it renders fine. What difference should possibly the font size cause for the page content and margins? Are not there standard page margins and page setups? Am I unknowingly (I am newbie to itextpdf) messing some standard implementations?

TL; DR: either remove setFontSize in
p
.setFont(font)
.setFontSize(12)
//.setKeepWithNext(true)
.setDestination(name)
or change setFontSize(10) -> setFontSize(12) in
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
Explanation: You are setting the Document to not immediately flush elements added to that document with the following line:
Document document = new Document(pdf, PageSize.A4,false);
Then you add an paragraph element with font size equal to 10 to the document with the following line:
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
What happens is that the element is being laid out (split in lines etc), but now drawn on the page. Then you do .setFontSize(12) and this new font size is applied for draw only, so iText calculated that X characters would fit into one line assuming the font size is 10 while in reality the font size is 12 and obviously fewer characters can fit into one line.
There is no sense in setting the font size two times to different values - just pick one value you want to see in the resultant document and set it once.

Accessing a COSArray for PDF fields with Apache PDFBox

I'm trying to access all form fields in a PDF file - so I can use code to fill them in - and this is as far as I've gotten:
PDDocumentCatalog pdCatalog = pdf.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
List<PDField> fieldList = pdAcroForm.getFields(); // fieldList.size() = 1
PDField field = fieldList.get(0);
COSDictionary dictionary = field.getCOSObject();
System.out.println("dictionary size = " + dictionary.size());
// my attempt to iterate through fields
for ( Map.Entry<COSName,COSBase> entry : dictionary.entrySet() )
{
COSName key = entry.getKey();
COSBase val = entry.getValue();
if ( val instanceof COSArray )
{
System.out.println("COSArray size = " + ((COSArray)val).size());
}
System.out.println("key = " + key);
System.out.println("val = " + val);
}
which gives an output of:
dictionary size = 3
COSArray size = 2
key = COSName{Kids}
val = COSArray{[COSObject{110, 0}, COSObject{108, 0}]}
key = COSName{T}
val = COSString{form1[0]}
key = COSName{V}
val = COSString{}
Does anyone know how I can access the two COSObjects in the COSArray? I also don't know what the notation COSObject{x, y} means, and can't find any documentation on this. If those are dictionary or array values elements, I also want to know how to access those.

You get the object with get(index) to get the COSObject (an indirect reference) or getObject(index) to get the dereferenced object referenced by the COSObject.
COSObject{110, 0} is the object number and the generation number (usually 0). Open your PDF file with NOTEPAD++ and look for "110 0 obj" to find it, or "110 0 R" to see who references this object.

Save custom page number labels in bookmarks

In the screenshot you can see custom page number labels (i, ii, iii, vii).
How can I save bookmarks with custom page number labels using PDFBox 2.0?
My code actually looks like this:
PDDocumentOutline documentOutline = new PDDocumentOutline();
document.getDocumentCatalog().setDocumentOutline(documentOutline);
PDOutlineItem outline = new PDOutlineItem();
outline.setTitle(toc.getName());
documentOutline.addLast(outline);
addToc(toc, outline);
outline.openNode();
documentOutline.openNode();
private void addToc(Toc toc, PDOutlineItem outlineItem) {
PDPageFitWidthDestination destination = new PDPageFitWidthDestination();
PDPage page = document.getPage(toc.getPageNumber() - 1);
destination.setPage(page);
PDOutlineItem bookmark = new PDOutlineItem();
bookmark.setDestination(destination);
bookmark.setTitle(toc.getName());
outlineItem.addLast(bookmark);
if (toc.getChildren() != null) {
for (Toc subToc : toc.getChildren()) {
addToc(subToc, bookmark);
}
}
}

You can only label pages, not bookmarks. In the example below (with 3 empty pages), roman numbers start at 3, and then decimal at 1. The prefix for the romans is "RO ". So the pages are "RO III", "RO IV", "1".
PDDocument doc = new PDDocument();
doc.addPage(new PDPage());
doc.addPage(new PDPage());
doc.addPage(new PDPage());
PDPageLabels pageLabels = new PDPageLabels(doc);
PDPageLabelRange pageLabelRange1 = new PDPageLabelRange();
pageLabelRange1.setPrefix("RO ");
pageLabelRange1.setStart(3);
pageLabelRange1.setStyle(PDPageLabelRange.STYLE_ROMAN_UPPER);
pageLabels.setLabelItem(0, pageLabelRange1);
PDPageLabelRange pageLabelRange2 = new PDPageLabelRange();
pageLabelRange2.setStart(1);
pageLabelRange2.setStyle(PDPageLabelRange.STYLE_DECIMAL);
pageLabels.setLabelItem(2, pageLabelRange2);
doc.getDocumentCatalog().setPageLabels(pageLabels);
doc.save("labels.pdf");
doc.close();

Using pdfbox to get form field values

I'm using pdfbox for the first time. Now I'm reading something on the website Pdf
Summarizing I have a pdf like this:
only that my file has many and many different component(textField,RadionButton,CheckBox). For this pdf I have to read these values : Mauro,Rossi,MyCompany. For now I wrote the following code:
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
for(PDField pdField : pdAcroForm.getFields()){
System.out.println(pdField.getValue())
}
Is this a correct way to read the value inside the form component?
Any suggestion about this?
Where can I learn other things on pdfbox?

The code you have should work. If you are actually looking to do something with the values, you'll likely need to use some other methods. For example, you can get specific fields using pdAcroForm.getField(<fieldName>):
PDField firstNameField = pdAcroForm.getField("firstName");
PDField lastNameField = pdAcroForm.getField("lastName");
Note that PDField is just a base class. You can cast things to sub classes to get more interesting information from them. For example:
PDCheckbox fullTimeSalary = (PDCheckbox) pdAcroForm.getField("fullTimeSalary");
if(fullTimeSalary.isChecked()) {
log.debug("The person earns a full-time salary");
} else {
log.debug("The person does not earn a full-time salary");
}
As you suggest, you'll find more information at the apache pdfbox website.

The field can be a top-level field. So you need to loop until it is no longer a top-level field, then you can get the value. Code snippet below loops through all the fields and outputs the field names and values.
{
//from your original code
PDDocument pdDoc = PDDocument.loadNonSeq( myFile, null );
PDDocumentCatalog pdCatalog = pdDoc.getDocumentCatalog();
PDAcroForm pdAcroForm = pdCatalog.getAcroForm();
//get all fields in form
List<PDField> fields = acroForm.getFields();
System.out.println(fields.size() + " top-level fields were found on the form");
//inspect field values
for (PDField field : fields)
{
processField(field, "|--", field.getPartialName());
}
...
}
private void processField(PDField field, String sLevel, String sParent) throws IOException
{
String partialName = field.getPartialName();
if (field instanceof PDNonTerminalField)
{
if (!sParent.equals(field.getPartialName()))
{
if (partialName != null)
{
sParent = sParent + "." + partialName;
}
}
System.out.println(sLevel + sParent);
for (PDField child : ((PDNonTerminalField)field).getChildren())
{
processField(child, "| " + sLevel, sParent);
}
}
else
{
//field has no child. output the value
String fieldValue = field.getValueAsString();
StringBuilder outputString = new StringBuilder(sLevel);
outputString.append(sParent);
if (partialName != null)
{
outputString.append(".").append(partialName);
}
outputString.append(" = ").append(fieldValue);
outputString.append(", type=").append(field.getClass().getName());
System.out.println(outputString);
}
}

Created documents are not versionable

I use OpenCmis in-memory for testing. But when I create a document I am not allowed to set the versioningState to something else then versioningState.NONE.
The doc created is not versionable some way... I used the code from http://chemistry.apache.org/java/examples/example-create-update.html
The test method:
public void test() {
String filename = "test123";
Folder folder = this.session.getRootFolder();
// Create a doc
Map<String, Object> properties = new HashMap<String, Object>();
properties.put(PropertyIds.OBJECT_TYPE_ID, "cmis:document");
properties.put(PropertyIds.NAME, filename);
String docText = "This is a sample document";
byte[] content = docText.getBytes();
InputStream stream = new ByteArrayInputStream(content);
ContentStream contentStream = this.session.getObjectFactory().createContentStream(filename, Long.valueOf(content.length), "text/plain", stream);
Document doc = folder.createDocument(
properties,
contentStream,
VersioningState.MAJOR);
}
The exception I get:
org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The versioning state flag is imcompatible to the type definition.
What am I missing?

I found the reason...
By executing the following code I discovered that the OBJECT_TYPE_ID 'cmis:document' don't allow versioning.
Code to view all available OBJECT_TYPE_ID's (source):
boolean includePropertyDefintions = true;
for (t in session.getTypeDescendants(
null, // start at the top of the tree
-1, // infinite depth recursion
includePropertyDefintions // include prop defs
)) {
printTypes(t, "");
}
static void printTypes(Tree tree, String tab) {
ObjectType objType = tree.getItem();
println(tab + "TYPE:" + objType.getDisplayName() +
" (" + objType.getDescription() + ")");
// Print some of the common attributes for this type
print(tab + " Id:" + objType.getId());
print(" Fileable:" + objType.isFileable());
print(" Queryable:" + objType.isQueryable());
if (objType instanceof DocumentType) {
print(" [DOC Attrs->] Versionable:" +
((DocumentType)objType).isVersionable());
print(" Content:" +
((DocumentType)objType).getContentStreamAllowed());
}
println(""); // end the line
for (t in tree.getChildren()) {
// there are more - call self for next level
printTypes(t, tab + " ");
}
}
This resulted in a list like this:
TYPE:CMIS Folder (Description of CMIS Folder Type) Id:cmis:folder
Fileable:true Queryable:true
TYPE:CMIS Document (Description of CMIS Document Type)
Id:cmis:document Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:My Type 1 Level 1 (Description of My Type 1 Level 1 Type)
Id:MyDocType1 Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:VersionedType (Description of VersionedType Type)
Id:VersionableType Fileable:true Queryable:true [DOC Attrs->]
Versionable:true Content:ALLOWED
As you can see the last OBJECT_TYPE_ID has versionable: true... and when I use that it does work.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

PDFBox: do PDDocument and PDPage have references to one another? - java

Related

itext 7 pdf how to prevent text overflow on right side of the page

Accessing a COSArray for PDF fields with Apache PDFBox

Save custom page number labels in bookmarks

Using pdfbox to get form field values

Created documents are not versionable

Categories

Resources