Accessing "alternate text" for an image via PDFBox

Accessing "alternate text" for an image via PDFBox - java

Is there some way to extract "alternate text" for a specific image using PDFBox?
I have a PDF file which, as described at http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1, has had alternate text added to an image. Using PDFBox I can find my way through the object model to the image itself (a PDXObjectImage) through PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() but I can not see any way to get from the image itself to the alternate text for it.
A small sample PDF (with a single image which has some alternate text specified) can be found at http://dl.dropbox.com/u/12253279/image_test_pass.pdf (It should say "This is the alternate text for the image.").

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.
Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).
Assuming you have access to the internal structure of the PDF file, you will need to:
1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.
Page content:
BT
/P <</MCID 0 >>BDC
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC
ET
/Figure <</MCID 1 >>BDC
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC
Your image:
2- In the page object, retrieve the key StructParents.
3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.
4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.
Looks easy, isn't it?
Tools used in this answer:
PDF Vole (based on iText)
Amyuni PDF Analyzer

Eric from the PDFBox mailing list sent me the following, though I've not tested it out yet...
Hi,
For your test file, here is a way to access "/Alt" entry :
PDDocument document = PDDocument.load("image_test_pass.pdf");
PDStructureTreeRoot treeRoot =
document.getDocumentCatalog().getStructureTreeRoot();
// get page for each StructElement
for (Object o : treeRoot.getKids()) {
if (o instanceof PDStructureElement) {
PDStructureElement structElement = (PDStructureElement)o;
System.out.println(structElement.getAlternateDescription());
PDPage page = structElement.getPage();
if (page != null) {
page.getResources().getImages();
}
}
}
Please refer to the PDF specification http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf and in particular §14.6, §14.7,
§14.9.3 and §14.9.4 to know all the rules in order to find the "/Alt"
entry. There seems to have several way to define this information.
BR,
Eric

Related

Apache PDFBox 2.0.18 - Comments/Annotations status linking

I need to merge comments taken from many versions of the same pdf file but with different comments, into one PDF file containing all comments.
I take all the comments from the pages and create an arrayList of them, then I simply set this array of comments on the new pdf file and it works pretty well.
The problem is that I also need to create an Excel with all the comments found and together with their "status" (accepted, cancelled, rejected, ecc...).
The status seems to be managed as a separate annotation/comment from PDFBox and I can't find any relation between a comment and its status.
Example:
I have a PDAnnotation object with content "COMMENT 1".
And I have another PDAnnotation object with content "Accepted by user XX" (the status of COMMENT 1).
On Acrobat Reader I see the comment "COMMENT 1" with the status set on "Accepted", so there must be a relation between the two objects, but I can't find it.
Any ideas?

Using the PDFDebugger is a good suggestion, it should give to you an overview of how objects (including PDAnnotations) are linked to each other
Anyway, check if in your child PDAnnotation in the COSDictionary you have a COSBase{IRT} key, that key should contain as value the parent COSObject
So if you do something link this:
COSDictionary parentDict = (COSDictionary) childDict.getDictionaryObject("IRT");
You should get the parent PDAnnotation dictionary and you can take all the data you need
Please notice the cast is necessary since getDictionaryObject returns a COSBase, but the object returned for the IRT key is actually a COSDictionary

Disable pdf-text searching with pdfBox

I have a pdf document (no form) where I want to disable the text searching using pdfBox (java).
Following possibilities I can imagine:
Flatten text
Remove Text information (without removing text itself)
Add overlay to document.
Currently I've no idea how I can implement that. Does anyone has an idea how to solve that?

many thanks for your help here. I guess I found a way that fit to the requirements. (Honestly, not really clean):
Add the rectangle to the address sections
convert PDF to image
convert image back to pdf.
While losing all text information, the user isn't able to see the critical information anymore. Due to the reason, that this is only for display (the initial PDF document doesn't get changed) this is ok for now.

It depends on your goals:
avoid everything on some texts: print, mark with black ink, and scan again;
delete sensible text: you have to scan inside text, and remove/replace it (with pdfbox), but it is risky (some text are splitted);
mask some text for viewer : find text and add a black rectangle (with pdfbox), but it is not very safe. You can remove the rectangle, or use another tool to read the text. Usually, if text is inside, some tool can find it;
avoiding copy/paste the text (but not search / view): use security options, with password:
see: https://pdfbox.apache.org/2.0/cookbook/encryption.html
PDDocument doc = PDDocument.load(new File("filename.pdf"));
// Define the length of the encryption key.
// Possible values are 40, 128 or 256.
int keyLength = 128;
// 256 => plante
AccessPermission ap = new AccessPermission();
// disable printing, everything else is allowed
ap.setCanPrint(false);
ap.setCanExtractContent(false);
ap.setCanExtractForAccessibility(false);
// Owner password (to open the file with all permissions) is "12345"
// User password (to open the file but with restricted permissions, is empty here)
StandardProtectionPolicy spp = new StandardProtectionPolicy("12345", "", ap);
spp.setEncryptionKeyLength(keyLength);
spp.setPermissions(ap);
doc.protect(spp);
doc.save("filename-encrypted2.pdf");
doc.close();

How create a table of contents page in the PDF file from the bookmarks with iText?

I need to create a page in the PDF to the content of table. I will create reading bookmark in PDF.
With iText I use:
tmp = SimpleBookmark.getBookmark (reader);
Testing with this PDF :
Download file PDF
Returns this MAP:
[{Action = GoTo, Named = __ WKANCHOR_2, Title = Secretariat Teste0}, {Action = GoTo, Named = __ WKANCHOR_4, Title = Secretariat TestBook1}, {Action = GoTo, Named = __ WKANCHOR_6, Title = Secretariat Test2}, {Action = GoTo , Named = __ WKANCHOR_8 ...
Without the page number.
How could show one content of table with title and page number?
I would like to show this:

Please read the answer to this question: Java: Reading PDF bookmark names with itext
It explains how you can use the SimpleBookmark method to get the titles of an outline tree (this is how "bookmarks" are called in the PDF specification).
public void inspectPdf(String filename) throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
List<HashMap<String,Object>> bookmarks = SimpleBookmark.getBookmark(reader);
for (int i = 0; i < bookmarks.size(); i++){
showTitle(bookmarks.get(i));
}
reader.close();
}
public void showTitle(HashMap<String, Object> bm) {
System.out.println((String)bm.get("Title"));
List<HashMap<String,Object>> kids = (List<HashMap<String,Object>>)bm.get("Kids");
if (kids != null) {
for (int i = 0; i < kids.size(); i++) {
showTitle(kids.get(i));
}
}
}
Then read the answer to this question: Set inherit Zoom(action property) to bookmark in the pdf file
You'll see that the HashMap<String, Object> doesn't only contain an entry with key "Title", but that it can also contain an entry with key "Page". That is the case when the bookmark points at a page. The value will be an explicit destination. It will consist of the page number, a value such as Fit, FitH, FitB, XYZ, followed by some parameters that mark the position.
If you look at the CreateOutlineTree example, you'll see that you can also extract the bookmarks as an XML file:
public void createXml(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
List<HashMap<String, Object>> list = SimpleBookmark.getBookmark(reader);
SimpleBookmark.exportToXML(list,
new FileOutputStream(dest), "ISO8859-1", true);
reader.close();
}
This is a screenshot from a book I wrote about iText that shows you which keys you can expect in a bookmark entry:
As you can tell from this table, a link can also be expressed as a named destination. In that case, you won't get the page number, but a name. To get the page number, you need to extract the list of named destinations. This list will get you the explicit destination corresponding with the named destination.
That is also explained in the book, as well as in the official documentation.
Once you have the titles and the page numbers (retrieved with code written based on the above pointers), you can insert pages to the PDF file using PdfStamper and the insertPage() method. You can put the TOC on those pages using ColumnText, or you can create a separate PDF for the TOC and merge it with the original one. See How to add a cover/PDF in a existing iText document to find out more about these two techniques.
You will also benefit from this example: Create Index File(TOC) for merged pdf using itext library in java
As for the dashed line between the title and the page number, that's done using a separator, more specifically a dotted line separator. You should read this question first: iTextSharp - Is it possible to set a different alignment in the same cell for text
Then read this question: How to Generate Table-of-Figures Dot Leaders in a PdfPCell for the Last Line of Wrapped Text (or this question It is possible with itext 5 which at the end of a paragraph justified the remaining space is filled with scripts?)
Note that your question is actually off-topic. It's phrased as a "home work" question. It invites people to do your work in your place. Now that you have all the elements you need, you should be able to do the work yourself. If you don't succeed, you should write an on topic Stack Overflow question. That's a question in which you show what you've tried and explain the technical problem you experience.
Update:
You shared a document with the following outline tree:
As you can see, the bookmarks are defined using Named destinations, such as /__WKANCHOR_2, /__WKANCHOR_4, and so on. As you can tell from the / character, the names are stored as PDF name objects (PDF 1.1), not as PDF string objects (since 1.2). The most recent PDF standards recommend to use PDF string objects instead of PDF name objects, you may want to ask the vendor of your PDF generation software to update the software to meet the recommendations of the most recent PDF standards.
Nevertheless, we can easily get the explicit destinations that correspond with those named destinations. They are stored in the /Dests entry of the root dictionary:
When you look at the way the destinations you see another problem that should be reported to wkhtmltopdf. Let's take a look at what the ISO standard tells us about the syntax to be used for destinations:
The concept of page numbers doesn't exist in PDF. Pages are described using page dictionaries, and the page number is derived from the position of the page in the page tree. The first page that is encountered in the page tree is page 1, the second page that is encountered is page 2, and so on.
In your example, the explication destinations are defined like this: [9/XYZ 30.2400000 524.179999 0], [9/XYZ 30.2400000 231.379999 0], and so on.
This is wrong. The ISO standard says that the first value in the array needs to be an indirect reference. An indirect reference has the format 9 0 R, not 9. I looked at the structure of the document, and I see that wkhtmltopdf uses a page number - 1 instead of an indirect reference. If I look at /__WKANCHOR_2, it refers to [0/XYZ 30.240000 781.459999 0] and that 0 is supposed to point to page 1. As Adobe Reader tolerates crappy software, this works in Adobe Reader, but as the file is in violation with ISO-32000, iText doesn't know what to do with those misleading destinations, at least, the convience class SimpleNamedDEstination doesn't know what to do with it.
Fortunately, iText is a very versatile library that allows you to go deep under the hood of a PDF. In this case, we only have to go one level deeper. Instead of SimpleNamedDestination.getNamedDestination(reader, true), we can use the following approach:
HashMap<String, PdfObject> names = reader.getNamedDestinationFromNames();
for (Map.Entry<String, PdfObject> entry: names.entrySet()) {
System.out.print(entry.getKey());
System.out.print(": p");
PdfArray arr = (PdfArray)entry.getValue();
System.out.println(arr.getAsNumber(0).intValue() + 1);
}
reader.close();
The output of this method is:
__WKANCHOR_w: p7
__WKANCHOR_y: p7
__WKANCHOR_2: p1
__WKANCHOR_4: p1
__WKANCHOR_16: p9
__WKANCHOR_14: p8
__WKANCHOR_18: p9
__WKANCHOR_1s: p13
__WKANCHOR_a: p2
__WKANCHOR_1q: p13
__WKANCHOR_1o: p12
__WKANCHOR_12: p8
__WKANCHOR_1m: p12
__WKANCHOR_e: p3
__WKANCHOR_10: p7
__WKANCHOR_1k: p12
__WKANCHOR_c: p3
__WKANCHOR_1i: p11
__WKANCHOR_i: p4
__WKANCHOR_8: p2
__WKANCHOR_g: p3
__WKANCHOR_1g: p11
__WKANCHOR_6: p1
__WKANCHOR_1e: p10
__WKANCHOR_m: p5
__WKANCHOR_1c: p10
__WKANCHOR_k: p4
__WKANCHOR_q: p5
__WKANCHOR_1a: p9
__WKANCHOR_o: p5
__WKANCHOR_u: p6
__WKANCHOR_s: p6
If we check __WKANCHOR_2, we see that it correctly points at page 1. I checked the final link in the outlines, it points at the named destination with name __WKANCHOR_1s and indeed: that should link to page 13.
Your problem is a clear example of a "garbage in-garbage out" problem. Your tool produces PDFs that are in violation with the ISO standard for PDF, and as a result you lose plenty of time trying to figure out what's wrong. But what's even worse: you made me lose time because of someone else's fault.

Java PDFBox Multi-Page Document With Headings and Table of Contents

The PDFBox v2.0 is still growing and doesn't have any nice and easy examples to get started with.
I need to create a multi-page PDF dynamically from an object; with TableOfContents and Headings!
How do I create Numbered Headings? (Increasing the font size is not an option because the TableOfContents has to know their location[page number] in the document)
Example:
Table of Contents
Document Title Here
1- Intro......................................................1
2- Heading....................................................2
2.1- SubHeading1 ........................................2
2.2- SubHeading2 ........................................5
Page 1
Document Title Here
1- Intro
This is an intro to the document....
2- Heading
2.1- Subheading
Some text here...
........
Page 2
I have two problems:
I followed this example here: PDFBox - how to create table of contents but it didn't create any TableOfContents.
I'm getting this exception:
java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object
at org.apache.pdfbox.pdmodel.interactive.action.PDActionGoTo.setDestination(PDActionGoTo.java:90)
I removed the PDActionGoTo as the answer comment said; it didn't give any exception but it didn't create any TableOfContents
I have no idea how to make headings!

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.