getting exception while redacting pdf using itext

getting exception while redacting pdf using itext - java

I am getting below exception while trying to redact pdf document using itext.
The issue is very sporadic like sometime it is working and sometimes it is throwing error.
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:60)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:991)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpContentOperator.invoke(PdfCleanUpContentOperator.java:140)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:286)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:425)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUpPage(PdfCleanUpProcessor.java:160)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUp(PdfCleanUpProcessor.java:135)
at RedactionClass.tgestRedactJavishsInput(RedactionClass.java:56)
at RedactionClass.main(RedactionClass.java:23)
Code which i am using to redact is below:
public static void testRedact() throws IOException, DocumentException {
InputStream resource = new FileInputStream("D:/itext/edited_120192824_5 (1).pdf");
OutputStream result = new FileOutputStream(new File(OUTPUTDIR,
"aviteshs.pdf"));
PdfReader reader = new PdfReader(resource);
PdfStamper stamper = new PdfStamper(reader, result);
int pageCount = reader.getNumberOfPages();
Rectangle linkLocation1 = new Rectangle(440f, 700f, 470f, 710f);
Rectangle linkLocation2 = new Rectangle(308f, 205f, 338f, 215f);
Rectangle linkLocation3 = new Rectangle(90f, 155f, 130f, 165f);
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
for (int currentPage = 1; currentPage <= pageCount; currentPage++) {
if (currentPage == 1) {
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation1, BaseColor.BLACK));
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation2, BaseColor.BLACK));
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation3, BaseColor.BLACK));
} else {
cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
linkLocation1, BaseColor.BLACK));
}
}
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,
stamper);
try {
cleaner.cleanUp();
} catch (Exception e) {
e.printStackTrace();
}
stamper.close();
reader.close();
}
Due to customer document i am unable to share it , trying to find out some test data for same.
Please find the doc here:
https://drive.google.com/file/d/0B-zalNTEeIOwM1JJVWctcW8ydU0/view?usp=drivesdk

In short: The cause of the NullPointerException here is that iText does not support form XObject resource inheritance from the page they are displayed on. According to the PDF specification this construct is obsolete but it can be encountered in PDFs obeying early PDF references instead of the specification.
The cause
Page 1 of the document in question contains 4 XObject resources named I1, M0, P1, and Q0:
As you can see in the screenshot, Q0 in particular has no own Resources dictionary. But its last instructions are
q
413 0 0 125 75 3086 cm
/I1 Do
Q
Id est it references a resource I1.
Now iText in case of form XObjects assumes that the resources their contents reference are contained in their own Resources dictionary.
The result: iText accesses a null dictionary and a NullPointerException occurs.
The specification
The PDF specification ISO 32000-1 specifies:
A resource dictionary shall be associated with a content stream in one of the following ways:
For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.
For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.
PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.
(ISO 32000-1, section 7.8.3 - Resource Dictionaries)
Thus, in the case at hand we are in the situation of the obsolete option three, Q0 references the XObject I1 defined in the resource dictionary of the page Q0 is used for.
The document in question has a version header claiming PDF 1.5 conformance (in contrast to PDF 1.7 of the PDF specification). So let's look at the PDF Reference 1.5. The paragraph there corresponding to option three is:
A form XObject or a Type 3 font’s glyph description may omit the Resources
entry, in which case resources will be looked up in the Resources entry of the
page on which the form or font is used. This practice is not recommended.
Summarized, therefore, the PDF in question uses a construct which the PDF specification (published in 2008, in use for nine years!) calls obsolete and even the PDF Reference the file claims conformance to recommends against. iText, on the other hand, does not support this obsolete construct.
Ideas how to fix this
Essentially the PDF Cleanup code must be extended to
remember the resources of the current page in the PdfCleanUpProcessor and
use these current page resources in the PdfCleanUpContentOperator method invoke in case of a Do operator referring to form XObject without own resources.
Unfortunately some members used in invoke are private. Thus, one has to either copy the PdfCleanUp code or fall back on reflection.
(iText 5.5.12-SNAPSHOT)
iText 7
The iText 7 PDF CleanUp tool also runs into an issue for your PDF, here the exception is a IllegalStateException claiming "Graphics state is always deleted after event dispatching. If you want to preserve it in renderer info, use preserveGraphicsState method after receiving renderer info."
As this exception is thrown during event dispatching, this error message does not make sense. Unfortunately the PDF CleanUp tool has become closed source in iText 7, so it is not so easy pinpointing the issue.
(iText 7.0.3-SNAPSHOT; PDF CleanUp 1.0.2-SNAPSHOT)

Related

pdfbox embedding subset font for annotations

I am trying to use Apache PDFBOX v2.0.21 to modify existing PDF documents, adding signatures and annotations. That means that I am actively using incremental save mode. I am also embedding LiberationSans font to accommodate some Unicode characters. It makes sense for me to use the subsetting feature of PDF embedded fonts as embedding LiberationSans in full makes the PDF file around 200+ KB more in side.
After multiple trials and errors I finally managed to have something working - all but the font subsetting. The way I do this is to initialize the PDFont object once using
try (InputStream fs = PDFService.class.getResourceAsStream("/static/fonts/LiberationSans-Regular.ttf")) {
_font = PDType0Font.load(pddoc, fs, true);
}
And then to use custom Appearance Stream to show the text.
private void addAnnotation(String name, PDDocument doc, PDPage page, float x, float y, String text) throws IOException {
List<PDAnnotation> annotations = page.getAnnotations();
PDAnnotationRubberStamp t = new PDAnnotationRubberStamp();
t.setAnnotationName(name); // might play important role
t.setPrinted(true); // always visible
t.setReadOnly(true); // does not interact with user
t.setContents(text);
PDRectangle rect = ....;
t.setRectangle(rect);
PDAppearanceDictionary ap = new PDAppearanceDictionary();
ap.setNormalAppearance(createAppearanceStream(doc, t));
ap.getCOSObject().setNeedToBeUpdated(true);
t.setAppearance(ap);
annotations.add(t);
page.setAnnotations(annotations);
t.getCOSObject().setNeedToBeUpdated(true);
page.getResources().getCOSObject().setNeedToBeUpdated(true);
page.getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getPages().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);
}
private PDAppearanceStream createAppearanceStream(final PDDocument document, PDAnnotation ann) throws IOException
{
PDAppearanceStream aps = new PDAppearanceStream(document);
PDRectangle rect = ann.getRectangle();
rect = new PDRectangle(0, 0, rect.getWidth(), rect.getHeight());
aps.setBBox(rect); // set bounding box to the dimensions of the annotation itself
// embed our unicode font (NB: yes, this needs to be done otherwise aps.getResources() == null which will cause NPE later during setFont)
PDResources res = new PDResources();
_fontName = res.add(_font).getName();
aps.setResources(res);
PDAppearanceContentStream apsContent = null;
try {
// draw directly on the XObject's content stream
apsContent = new PDAppearanceContentStream(aps);
apsContent.beginText();
apsContent.setFont(_font, _fontSize);
apsContent.showText(ann.getContents());
apsContent.endText();
}
finally {
if (apsContent != null) {
try { apsContent.close(); } catch (Exception ex) { log.error(ex.getMessage(), ex); }
}
}
aps.getResources().getCOSObject().setNeedToBeUpdated(true);
aps.getCOSObject().setNeedToBeUpdated(true);
return aps;
}
This code runs, but creates a PDF with dots instead of actual characters, which, I guess, means that the font subset has not been embedded. Moreover, I get the following warnings:
2021-04-17 12:33:31.326 WARN 20820 --- [ main]
o.a.p.pdmodel.PDAbstractContentStream : attempting to use subset
font LiberationSans without proper context
After looking through the source code, I get and I guess that I am messing something up when creating the appearance stream - somehow it's not connected with the PDDocument and the subsetting does not continue normally. Note that the above code works well when the font is embedded fully (i.e. if I call PDType0Font.load with the last parameter set to false)
Can anyone think of some hint to give to me? Thank you!

I don't know - am I lucky? It is very often that luckiness in programming points to something completely wrong or misleading. In any case, if someone can still give a hint, my ears are more than open...
Again, after looking through the code, I saw the following in PDDocument.save():
// subset designated fonts
for (PDFont font : fontsToSubset)
{
font.subset();
}
This is not happening in PDDocument.saveIncremental() which I am using. Just to mess around with the code, I went and did the following just before calling saveIncremental() on my document:
_font.subset(); // you can see in the beginning of the question how _font is created
_font.getCOSObject().setNeedToBeUpdated(true);
pddoc.saveIncremental(baos);
Believe it or not, but the document was saved correctly - at least it appears correct in Acrobat Reader DC and Chrome & Firefox PDF viewers. Note that Unicode codepoints are added to the subset for the font during showText() on appearance content stream.
UPDATE 18/04/2021: as I mentioned in the comments, I got reports from users that started seeing messages like "Cannot extract the embedded font XXXXXX+LiberationSans-Regular from ...", when they opened the modified PDF files. Strangely enough, I didn't see these messages during my tests. It turns out that my copy of Acrobat Reader DC was newer than theirs, and specifically with the continuous release version 2021.001.20149 no errors were shown, while with the continuous release version 2020.012.20043 the above message was shown.
After investigations, it turns out that the problem was with the way I was embedding the font. I am not aware if any other way exists, and I am not that familiar with the PDF specification to know otherwise. What I was doing, as you can see from the above code, was to load the font ONCE for the document, and then to use it freely in the resource dictionary of the appearance stream of EVERY annotation. This had as a result all the resource dictionaries of the annotation content streams to reference an F1 font that was defined with the SAME /BaseFont name. The PDF Reference, 3rd ed. on p.323 specifically states that:
"... the PostScript name of the font - ... - begins with a tag
followed by a plus sign (+). The tag consists of exactly six uppercase
letters; the choice of letters is arbitrary, but different subsets in
the same PDF file must have different tags..."
Once I started to call PDType0Font.load for each of my annotations and calling subset() (and of course setNeedToBeUpdated) after creating appearance stream for each of them, I saw that the BaseName attributes started to look indeed differently - and indeed, the older 2020 version of Acrobat Reader DC stopped complaining.
[edit 07/10/2021: even trying to use a single PDFont object per page (having multiple annotations with this font), and subsetting it once, after having called showText on appearances of all annotations, appears to not work - it appears that the subsetting uses the letters I passed to the first showText, and not the others, resulting in wrong rendering of the 2nd, 3rd etc. annotations that might have characters that didn't exist in the 1st annotation - so I reiterate that what worked was to use loadFont for each separate annotation and then (after modifying appearance with showText, which will mark the letters to be used during subsetting) to call subset() on each of these fonts (which will result in the change of the font name)]
Note that other than using iText RUPS for inspecting the PDF contents, one could use Foxit PDF viewer to at least ensure that the subset font names are different. Acrobat Reader DC and PDF-xChange in Properties -> Fonts just show the initial font name, like LiberationSans, without showing the 6-letter unique prefix.
UPDATE 19/04/2021 I am still working on this issue - because I still get reports about the infamous "Cannot extract the embedded font" message. It is quite possible that the original cause of that message was not (or not only) the fact that the different subsets had same BaseFont names. One thing that I am observing is that on some computers, the stamp annotations that I am using cause Acrobat Reader DC to open automatically the so called "Comments pane" - there are options to turn this automatic thing off (Preferences -> Commenting -> Show comments pane when a PDF with comments is opened). When this pane opens, either manually or automatically, the error message appears (and I was on my wits ends to see why same version of Acrobat Reader DC behaves differently for different machines). I think that Acrobat Reader tries to extract the full version of the font and fails, since it is only a subset. But, I guess, this doesn't have to do with the semantic contents of the document - the document still passes "qpdf --check". I am currently trying to find if it is possible to restrict stamps to not allow comments - i.e. some way to disable the comments pane in Acrobat Reader DC, although I have little hope.
UPDATE 20/04/2021 opened a new question here

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

PDF/A was validated correctly with preflight but online pdf-tools does not validate it

Preflight (version 2.0.15) tool has validated correctly the generated pdf (was created with pdfbox version 2.0.15) file but online pdf-tools (e.x. https://www.pdf-online.com/osa/validate.aspx) does not validate it correctly. I am getting below error:
Compliance pdfa-1b
Result Document does not conform to PDF/A.
Details
Validating file "file.pdf" for conformance level pdfa-1b
Anonymous RDF resources (rdf:Description without rdf:about attribute) are not allowed in XMP Metadata.
The appearance dictionary doesn't contain an entry.
The appearance dictionary doesn't contain an entry.
The appearance dictionary doesn't contain an entry.
The appearance dictionary doesn't contain an entry.
The appearance dictionary doesn't contain an entry.
The document does not conform to the requested standard.
The document contains annotations or form fields with ambigous or without appropriate appearances.
The document's meta data is either missing or inconsistent or corrupt.
The document does not conform to the PDF/A-1b standard.
Done.
In order to generate metadata I use below code:
private void addMetadata(PDDocument pdDocument,final String zzz,final String yyy) {
PDDocumentCatalog catalog = pdDocument.getDocumentCatalog();
PDDocumentInformation info = pdDocument.getDocumentInformation();
info.setCreationDate(Calendar.getInstance());
info.setModificationDate(Calendar.getInstance());
info.setAuthor(metadataAuthor);
info.setProducer(metadataProducer);
info.setTitle(zzz + "_" + yyy);
info.setKeywords("aaa");
info.setCreator("aaa");
info.setSubject("aaa");
PDMarkInfo markInfo = new PDMarkInfo();
markInfo.setMarked(true);
catalog.setMarkInfo(markInfo);
try {
PDMetadata metadataStream = new PDMetadata(pdDocument);
catalog.setMetadata( metadataStream );
XMPMetadata xmp = new XMPMetadata();
XMPSchemaPDFAId pdfaid = new XMPSchemaPDFAId(xmp);
xmp.addSchema(pdfaid);
pdfaid.setConformance("B");
pdfaid.setPart(1);
pdfaid.setAbout("");
XMPSchemaDublinCore dcSchema = xmp.addDublinCoreSchema();
dcSchema.setTitle( info.getTitle() );
dcSchema.addCreator("aaa");
dcSchema.setDescription( info.getSubject() );
XMPSchemaPDF pdfSchema = xmp.addPDFSchema();
pdfSchema.setKeywords( info.getKeywords() );
pdfSchema.setProducer( info.getProducer() );
XMPSchemaBasic basicSchema = xmp.addBasicSchema();
basicSchema.setModifyDate( info.getModificationDate() );
basicSchema.setCreateDate( info.getCreationDate() );
basicSchema.setCreatorTool( info.getCreator() );
metadataStream.importXMPMetadata(xmp.asByteArray());
InputStream colorProfile = getClass().getClassLoader().getResourceAsStream("icm/sRGB Color Space Profile.icm");
// create output intent
PDOutputIntent oi = new PDOutputIntent(pdDocument, colorProfile);
String value = "sRGB IEC61966-2.1";
oi.setInfo(value);
oi.setOutputCondition(value);
oi.setOutputConditionIdentifier(value);
oi.setRegistryName("http://www.color.org");
catalog.addOutputIntent(oi);
} catch (Exception e) {
e.printStackTrace()
}
}
Any suggestions?

As discussed in the comments:
1) The failure to report "The appearance dictionary doesn't contain an entry" is a bug in PDFBox preflight that will be fixed in 2.0.17, see PDFBOX-4586. According to this document:
An ISO 19005-1 validator shall FAIL otherwise conforming files in
which a widget annotation lacks an appearance dictionary
2) The "rdf:Description without rdf:about attribute" may or may not be a bug. VeraPDF doesn't consider it to be one. Your code used an 1.8.* version. For these, you can call dcSchema.setAbout("") to fix this. In 2.0.* the problem doesn't occur if you created the schema with metadata.createAndAddDublinCoreSchema().
I have created an issue in the VeraPDF project and they will bring this question for discussion at the next meeting of the Validation technical working group.
3) That the widgets didn't contain an entry is because at the time setValue() was called, not enough information was present (e.g. the rectangle).That is why you got the message widget of field aa has no rectangle, no appearance stream created.

PdfBox flatten pdf does not remove acroform elements

I have a pdf with a lot of acroforms, I do some manipulation on it which results in a new pdf.
So I have PDF-1 (which is the original one )and PDF-2 (just a duplication of PDF-1), now I want to merge them. Both PDFs have some acroforms for example: field_a, field_2...
Before I merge them I flatten PDF-1, because I only want to have acrofields from PDF-2. When I check then my new merged PDF I can see that there are no visible fields on on the pages from PDF-1 and there are fields on pages of fields of PDF-2. At the first look it seems ok, but when I inspect the fields I can see that the merger has renamed all the fields for PDF-2 e.g. field_a_dummy123, field_b_dummy232 ...
It seems to me, that flattening does not remove the fields and thats why the PDFMerger from PDFBox will rename the fields for PDF-2 because acrofields need to be unique. Is there a way to completely remove the acroforms of PDF-1?
#Test
public void flattenAndMerge() throws IOException {
File testForm = new File(classLoader.getResource("./TestForm.pdf").getFile());
byte[] testFormAsByte = Files.readAllBytes(testForm.toPath());
byte[] testFormAsByte2 = Files.readAllBytes(testForm.toPath());
PDDocument pdf1 = PDDocument.load(testFormAsByte);
PDAcroForm acroform = pdf1.getDocumentCatalog().getAcroForm();
acroform.flatten();
Path flattendedPdf = Files.createTempFile("flatten", ".pdf");
pdf1.save(flattendedPdf.toFile());
PDFMergerUtility merger = new PDFMergerUtility();
merger.addSource(new ByteArrayInputStream(Files.readAllBytes(flattendedPdf)));
merger.addSource(new ByteArrayInputStream(testFormAsByte2));
merger.setDestinationFileName("./build/flattenAndMerge.pdf");
merger.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
}
I am using PDFBox 2.0.8.
This is the input file: https://ufile.io/6etxp
Here is the result of the test: https://ufile.io/bh94n
As I could see the problem only occures with checkboxes, normal text fields will be removed correctly

As already mentioned in a comment:
Indeed, this is a bug. It is not, though, as the OP has assumed that flattening does not remove the fields, it is a problem of the merging code in PDFMergerUtility.mergeAcroForm.
The underlying problem is in the handling of non-trivial field hierarchies: In the sample source document shared by the OP the checkbox fields are not top-level fields but they are located under the top level node "cb_a".
In the merged document they are not only renamed but also added to the list of top-level form fields; this actually is not valid as they still have a Parent reference to "cb_a".
This bug currently is discussed and resolved in the context of the Apacha Jira entry PDFBOX-4066.

How create a table of contents page in the PDF file from the bookmarks with iText?

I need to create a page in the PDF to the content of table. I will create reading bookmark in PDF.
With iText I use:
tmp = SimpleBookmark.getBookmark (reader);
Testing with this PDF :
Download file PDF
Returns this MAP:
[{Action = GoTo, Named = __ WKANCHOR_2, Title = Secretariat Teste0}, {Action = GoTo, Named = __ WKANCHOR_4, Title = Secretariat TestBook1}, {Action = GoTo, Named = __ WKANCHOR_6, Title = Secretariat Test2}, {Action = GoTo , Named = __ WKANCHOR_8 ...
Without the page number.
How could show one content of table with title and page number?
I would like to show this:

Please read the answer to this question: Java: Reading PDF bookmark names with itext
It explains how you can use the SimpleBookmark method to get the titles of an outline tree (this is how "bookmarks" are called in the PDF specification).
public void inspectPdf(String filename) throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
List<HashMap<String,Object>> bookmarks = SimpleBookmark.getBookmark(reader);
for (int i = 0; i < bookmarks.size(); i++){
showTitle(bookmarks.get(i));
}
reader.close();
}
public void showTitle(HashMap<String, Object> bm) {
System.out.println((String)bm.get("Title"));
List<HashMap<String,Object>> kids = (List<HashMap<String,Object>>)bm.get("Kids");
if (kids != null) {
for (int i = 0; i < kids.size(); i++) {
showTitle(kids.get(i));
}
}
}
Then read the answer to this question: Set inherit Zoom(action property) to bookmark in the pdf file
You'll see that the HashMap<String, Object> doesn't only contain an entry with key "Title", but that it can also contain an entry with key "Page". That is the case when the bookmark points at a page. The value will be an explicit destination. It will consist of the page number, a value such as Fit, FitH, FitB, XYZ, followed by some parameters that mark the position.
If you look at the CreateOutlineTree example, you'll see that you can also extract the bookmarks as an XML file:
public void createXml(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
List<HashMap<String, Object>> list = SimpleBookmark.getBookmark(reader);
SimpleBookmark.exportToXML(list,
new FileOutputStream(dest), "ISO8859-1", true);
reader.close();
}
This is a screenshot from a book I wrote about iText that shows you which keys you can expect in a bookmark entry:
As you can tell from this table, a link can also be expressed as a named destination. In that case, you won't get the page number, but a name. To get the page number, you need to extract the list of named destinations. This list will get you the explicit destination corresponding with the named destination.
That is also explained in the book, as well as in the official documentation.
Once you have the titles and the page numbers (retrieved with code written based on the above pointers), you can insert pages to the PDF file using PdfStamper and the insertPage() method. You can put the TOC on those pages using ColumnText, or you can create a separate PDF for the TOC and merge it with the original one. See How to add a cover/PDF in a existing iText document to find out more about these two techniques.
You will also benefit from this example: Create Index File(TOC) for merged pdf using itext library in java
As for the dashed line between the title and the page number, that's done using a separator, more specifically a dotted line separator. You should read this question first: iTextSharp - Is it possible to set a different alignment in the same cell for text
Then read this question: How to Generate Table-of-Figures Dot Leaders in a PdfPCell for the Last Line of Wrapped Text (or this question It is possible with itext 5 which at the end of a paragraph justified the remaining space is filled with scripts?)
Note that your question is actually off-topic. It's phrased as a "home work" question. It invites people to do your work in your place. Now that you have all the elements you need, you should be able to do the work yourself. If you don't succeed, you should write an on topic Stack Overflow question. That's a question in which you show what you've tried and explain the technical problem you experience.
Update:
You shared a document with the following outline tree:
As you can see, the bookmarks are defined using Named destinations, such as /__WKANCHOR_2, /__WKANCHOR_4, and so on. As you can tell from the / character, the names are stored as PDF name objects (PDF 1.1), not as PDF string objects (since 1.2). The most recent PDF standards recommend to use PDF string objects instead of PDF name objects, you may want to ask the vendor of your PDF generation software to update the software to meet the recommendations of the most recent PDF standards.
Nevertheless, we can easily get the explicit destinations that correspond with those named destinations. They are stored in the /Dests entry of the root dictionary:
When you look at the way the destinations you see another problem that should be reported to wkhtmltopdf. Let's take a look at what the ISO standard tells us about the syntax to be used for destinations:
The concept of page numbers doesn't exist in PDF. Pages are described using page dictionaries, and the page number is derived from the position of the page in the page tree. The first page that is encountered in the page tree is page 1, the second page that is encountered is page 2, and so on.
In your example, the explication destinations are defined like this: [9/XYZ 30.2400000 524.179999 0], [9/XYZ 30.2400000 231.379999 0], and so on.
This is wrong. The ISO standard says that the first value in the array needs to be an indirect reference. An indirect reference has the format 9 0 R, not 9. I looked at the structure of the document, and I see that wkhtmltopdf uses a page number - 1 instead of an indirect reference. If I look at /__WKANCHOR_2, it refers to [0/XYZ 30.240000 781.459999 0] and that 0 is supposed to point to page 1. As Adobe Reader tolerates crappy software, this works in Adobe Reader, but as the file is in violation with ISO-32000, iText doesn't know what to do with those misleading destinations, at least, the convience class SimpleNamedDEstination doesn't know what to do with it.
Fortunately, iText is a very versatile library that allows you to go deep under the hood of a PDF. In this case, we only have to go one level deeper. Instead of SimpleNamedDestination.getNamedDestination(reader, true), we can use the following approach:
HashMap<String, PdfObject> names = reader.getNamedDestinationFromNames();
for (Map.Entry<String, PdfObject> entry: names.entrySet()) {
System.out.print(entry.getKey());
System.out.print(": p");
PdfArray arr = (PdfArray)entry.getValue();
System.out.println(arr.getAsNumber(0).intValue() + 1);
}
reader.close();
The output of this method is:
__WKANCHOR_w: p7
__WKANCHOR_y: p7
__WKANCHOR_2: p1
__WKANCHOR_4: p1
__WKANCHOR_16: p9
__WKANCHOR_14: p8
__WKANCHOR_18: p9
__WKANCHOR_1s: p13
__WKANCHOR_a: p2
__WKANCHOR_1q: p13
__WKANCHOR_1o: p12
__WKANCHOR_12: p8
__WKANCHOR_1m: p12
__WKANCHOR_e: p3
__WKANCHOR_10: p7
__WKANCHOR_1k: p12
__WKANCHOR_c: p3
__WKANCHOR_1i: p11
__WKANCHOR_i: p4
__WKANCHOR_8: p2
__WKANCHOR_g: p3
__WKANCHOR_1g: p11
__WKANCHOR_6: p1
__WKANCHOR_1e: p10
__WKANCHOR_m: p5
__WKANCHOR_1c: p10
__WKANCHOR_k: p4
__WKANCHOR_q: p5
__WKANCHOR_1a: p9
__WKANCHOR_o: p5
__WKANCHOR_u: p6
__WKANCHOR_s: p6
If we check __WKANCHOR_2, we see that it correctly points at page 1. I checked the final link in the outlines, it points at the named destination with name __WKANCHOR_1s and indeed: that should link to page 13.
Your problem is a clear example of a "garbage in-garbage out" problem. Your tool produces PDFs that are in violation with the ISO standard for PDF, and as a result you lose plenty of time trying to figure out what's wrong. But what's even worse: you made me lose time because of someone else's fault.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.