Java PDFBox Multi-Page Document With Headings and Table of Contents

Java PDFBox Multi-Page Document With Headings and Table of Contents - java

The PDFBox v2.0 is still growing and doesn't have any nice and easy examples to get started with.
I need to create a multi-page PDF dynamically from an object; with TableOfContents and Headings!
How do I create Numbered Headings? (Increasing the font size is not an option because the TableOfContents has to know their location[page number] in the document)
Example:
Table of Contents
Document Title Here
1- Intro......................................................1
2- Heading....................................................2
2.1- SubHeading1 ........................................2
2.2- SubHeading2 ........................................5
Page 1
Document Title Here
1- Intro
This is an intro to the document....
2- Heading
2.1- Subheading
Some text here...
........
Page 2
I have two problems:
I followed this example here: PDFBox - how to create table of contents but it didn't create any TableOfContents.
I'm getting this exception:
java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object
at org.apache.pdfbox.pdmodel.interactive.action.PDActionGoTo.setDestination(PDActionGoTo.java:90)
I removed the PDActionGoTo as the answer comment said; it didn't give any exception but it didn't create any TableOfContents
I have no idea how to make headings!

Related

Apache PDFBox 2.0.18 - Comments/Annotations status linking

I need to merge comments taken from many versions of the same pdf file but with different comments, into one PDF file containing all comments.
I take all the comments from the pages and create an arrayList of them, then I simply set this array of comments on the new pdf file and it works pretty well.
The problem is that I also need to create an Excel with all the comments found and together with their "status" (accepted, cancelled, rejected, ecc...).
The status seems to be managed as a separate annotation/comment from PDFBox and I can't find any relation between a comment and its status.
Example:
I have a PDAnnotation object with content "COMMENT 1".
And I have another PDAnnotation object with content "Accepted by user XX" (the status of COMMENT 1).
On Acrobat Reader I see the comment "COMMENT 1" with the status set on "Accepted", so there must be a relation between the two objects, but I can't find it.
Any ideas?

Using the PDFDebugger is a good suggestion, it should give to you an overview of how objects (including PDAnnotations) are linked to each other
Anyway, check if in your child PDAnnotation in the COSDictionary you have a COSBase{IRT} key, that key should contain as value the parent COSObject
So if you do something link this:
COSDictionary parentDict = (COSDictionary) childDict.getDictionaryObject("IRT");
You should get the parent PDAnnotation dictionary and you can take all the data you need
Please notice the cast is necessary since getDictionaryObject returns a COSBase, but the object returned for the IRT key is actually a COSDictionary

MigraDOC and pdfSharp, center dynamic text

i was looking for a solution to create a document (PDF) in such why that i can calculate an input field:
The PDF content should look something like this
(all pdf content needs to be centered)
Header
Field1 : not dynamic (needs to be centered)
Field2 : UserName (dynamic- needs to be appended in the center of the
paragraph) - since each user has a different name length
So my questions is , does pdfSharp or migraDoc has a method or something that can align the text to center (meaning that it does some calculates - determine the font-family, font-size and does the magic so that in the end the marked text is centered) ?
If so what is the method since i've searched the migraDoc and pdfSharp documentation and could not find anything like that.
And if such method does not exist, did someone try this? worked with it? has any suggestions how can i achieve this behavior? maybe some source to look from.
Thank you

Sample 1 shows with a lot of code example how to create a pdf and use almost all functionality that Migradoc offers. Sample 2 shows in detail how to create tables, which might also be interesting for you in respect to layout the page's content.
The alignment (center/left/right) is usually done by setting the Format.Alignment property like this:
Paragraph par = new Paragraph();
par.Format.Alignment = ParagraphAlignment.Center;
A short version of a document with centered content would be:
// first you need a document
Document MigraDokument = new Document();
// each document needs at least one section
Section section = MigraDokument.AddSection();
section.PageSetup.PageFormat = PageFormat.A4;
// and then you add paragraphs to the section
Paragraph par = section.AddParagraph();
// and set the alignment as you wish
par.Format.Alignment = ParagraphAlignment.Center;
// now just fill it with content and set the rest of the parameters...
par.AddText("text");

How to create or update table of contents in a word file?

I can read or write word document in Java using Apache POI or docx4j. But I cant find any references to create or update table of contents in a Word file. Is there any other API can support TOC in Java? Or, is it possible in Apache POI or docx4j to have options to create or update TOC?

To create table of contents with apache poi you can just use:
doc.createTOC();
But it seems a bit buggy. The TOC is created but the (MS Office pro 2010) does not seem to recognize it as TOC and the references are not working.
Or you can call:
doc.enforceUpdateFields();
This will create a popup in word document with: "This document contains fields that may refer to other files. Do you want to update the fields in this document?", which looks a bit dodgy if you are opening a new doc :)

There's a cleaner way for this too.
You just need to open a empty docx which will act as a template. Add some sample text into it with the style that you want to include and then this piece of code will work.
XWPFDocument document = new XWPFDocument(new FileInputStream("template.docx");
paragraph = document.createParagraph();
lastParagraph.setStyle("Heading1");

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

Accessing "alternate text" for an image via PDFBox

Is there some way to extract "alternate text" for a specific image using PDFBox?
I have a PDF file which, as described at http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1, has had alternate text added to an image. Using PDFBox I can find my way through the object model to the image itself (a PDXObjectImage) through PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() but I can not see any way to get from the image itself to the alternate text for it.
A small sample PDF (with a single image which has some alternate text specified) can be found at http://dl.dropbox.com/u/12253279/image_test_pass.pdf (It should say "This is the alternate text for the image.").

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.
Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).
Assuming you have access to the internal structure of the PDF file, you will need to:
1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.
Page content:
BT
/P <</MCID 0 >>BDC
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC
ET
/Figure <</MCID 1 >>BDC
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC
Your image:
2- In the page object, retrieve the key StructParents.
3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.
4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.
Looks easy, isn't it?
Tools used in this answer:
PDF Vole (based on iText)
Amyuni PDF Analyzer

Eric from the PDFBox mailing list sent me the following, though I've not tested it out yet...
Hi,
For your test file, here is a way to access "/Alt" entry :
PDDocument document = PDDocument.load("image_test_pass.pdf");
PDStructureTreeRoot treeRoot =
document.getDocumentCatalog().getStructureTreeRoot();
// get page for each StructElement
for (Object o : treeRoot.getKids()) {
if (o instanceof PDStructureElement) {
PDStructureElement structElement = (PDStructureElement)o;
System.out.println(structElement.getAlternateDescription());
PDPage page = structElement.getPage();
if (page != null) {
page.getResources().getImages();
}
}
}
Please refer to the PDF specification http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf and in particular §14.6, §14.7,
§14.9.3 and §14.9.4 to know all the rules in order to find the "/Alt"
entry. There seems to have several way to define this information.
BR,
Eric

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.