MigraDOC and pdfSharp, center dynamic text

MigraDOC and pdfSharp, center dynamic text - java

i was looking for a solution to create a document (PDF) in such why that i can calculate an input field:
The PDF content should look something like this
(all pdf content needs to be centered)
Header
Field1 : not dynamic (needs to be centered)
Field2 : UserName (dynamic- needs to be appended in the center of the
paragraph) - since each user has a different name length
So my questions is , does pdfSharp or migraDoc has a method or something that can align the text to center (meaning that it does some calculates - determine the font-family, font-size and does the magic so that in the end the marked text is centered) ?
If so what is the method since i've searched the migraDoc and pdfSharp documentation and could not find anything like that.
And if such method does not exist, did someone try this? worked with it? has any suggestions how can i achieve this behavior? maybe some source to look from.
Thank you

Sample 1 shows with a lot of code example how to create a pdf and use almost all functionality that Migradoc offers. Sample 2 shows in detail how to create tables, which might also be interesting for you in respect to layout the page's content.
The alignment (center/left/right) is usually done by setting the Format.Alignment property like this:
Paragraph par = new Paragraph();
par.Format.Alignment = ParagraphAlignment.Center;
A short version of a document with centered content would be:
// first you need a document
Document MigraDokument = new Document();
// each document needs at least one section
Section section = MigraDokument.AddSection();
section.PageSetup.PageFormat = PageFormat.A4;
// and then you add paragraphs to the section
Paragraph par = section.AddParagraph();
// and set the alignment as you wish
par.Format.Alignment = ParagraphAlignment.Center;
// now just fill it with content and set the rest of the parameters...
par.AddText("text");

Related

Disable pdf-text searching with pdfBox

I have a pdf document (no form) where I want to disable the text searching using pdfBox (java).
Following possibilities I can imagine:
Flatten text
Remove Text information (without removing text itself)
Add overlay to document.
Currently I've no idea how I can implement that. Does anyone has an idea how to solve that?

many thanks for your help here. I guess I found a way that fit to the requirements. (Honestly, not really clean):
Add the rectangle to the address sections
convert PDF to image
convert image back to pdf.
While losing all text information, the user isn't able to see the critical information anymore. Due to the reason, that this is only for display (the initial PDF document doesn't get changed) this is ok for now.

It depends on your goals:
avoid everything on some texts: print, mark with black ink, and scan again;
delete sensible text: you have to scan inside text, and remove/replace it (with pdfbox), but it is risky (some text are splitted);
mask some text for viewer : find text and add a black rectangle (with pdfbox), but it is not very safe. You can remove the rectangle, or use another tool to read the text. Usually, if text is inside, some tool can find it;
avoiding copy/paste the text (but not search / view): use security options, with password:
see: https://pdfbox.apache.org/2.0/cookbook/encryption.html
PDDocument doc = PDDocument.load(new File("filename.pdf"));
// Define the length of the encryption key.
// Possible values are 40, 128 or 256.
int keyLength = 128;
// 256 => plante
AccessPermission ap = new AccessPermission();
// disable printing, everything else is allowed
ap.setCanPrint(false);
ap.setCanExtractContent(false);
ap.setCanExtractForAccessibility(false);
// Owner password (to open the file with all permissions) is "12345"
// User password (to open the file but with restricted permissions, is empty here)
StandardProtectionPolicy spp = new StandardProtectionPolicy("12345", "", ap);
spp.setEncryptionKeyLength(keyLength);
spp.setPermissions(ap);
doc.protect(spp);
doc.save("filename-encrypted2.pdf");
doc.close();

How create a table of contents page in the PDF file from the bookmarks with iText?

I need to create a page in the PDF to the content of table. I will create reading bookmark in PDF.
With iText I use:
tmp = SimpleBookmark.getBookmark (reader);
Testing with this PDF :
Download file PDF
Returns this MAP:
[{Action = GoTo, Named = __ WKANCHOR_2, Title = Secretariat Teste0}, {Action = GoTo, Named = __ WKANCHOR_4, Title = Secretariat TestBook1}, {Action = GoTo, Named = __ WKANCHOR_6, Title = Secretariat Test2}, {Action = GoTo , Named = __ WKANCHOR_8 ...
Without the page number.
How could show one content of table with title and page number?
I would like to show this:

Please read the answer to this question: Java: Reading PDF bookmark names with itext
It explains how you can use the SimpleBookmark method to get the titles of an outline tree (this is how "bookmarks" are called in the PDF specification).
public void inspectPdf(String filename) throws IOException, DocumentException {
PdfReader reader = new PdfReader(filename);
List<HashMap<String,Object>> bookmarks = SimpleBookmark.getBookmark(reader);
for (int i = 0; i < bookmarks.size(); i++){
showTitle(bookmarks.get(i));
}
reader.close();
}
public void showTitle(HashMap<String, Object> bm) {
System.out.println((String)bm.get("Title"));
List<HashMap<String,Object>> kids = (List<HashMap<String,Object>>)bm.get("Kids");
if (kids != null) {
for (int i = 0; i < kids.size(); i++) {
showTitle(kids.get(i));
}
}
}
Then read the answer to this question: Set inherit Zoom(action property) to bookmark in the pdf file
You'll see that the HashMap<String, Object> doesn't only contain an entry with key "Title", but that it can also contain an entry with key "Page". That is the case when the bookmark points at a page. The value will be an explicit destination. It will consist of the page number, a value such as Fit, FitH, FitB, XYZ, followed by some parameters that mark the position.
If you look at the CreateOutlineTree example, you'll see that you can also extract the bookmarks as an XML file:
public void createXml(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
List<HashMap<String, Object>> list = SimpleBookmark.getBookmark(reader);
SimpleBookmark.exportToXML(list,
new FileOutputStream(dest), "ISO8859-1", true);
reader.close();
}
This is a screenshot from a book I wrote about iText that shows you which keys you can expect in a bookmark entry:
As you can tell from this table, a link can also be expressed as a named destination. In that case, you won't get the page number, but a name. To get the page number, you need to extract the list of named destinations. This list will get you the explicit destination corresponding with the named destination.
That is also explained in the book, as well as in the official documentation.
Once you have the titles and the page numbers (retrieved with code written based on the above pointers), you can insert pages to the PDF file using PdfStamper and the insertPage() method. You can put the TOC on those pages using ColumnText, or you can create a separate PDF for the TOC and merge it with the original one. See How to add a cover/PDF in a existing iText document to find out more about these two techniques.
You will also benefit from this example: Create Index File(TOC) for merged pdf using itext library in java
As for the dashed line between the title and the page number, that's done using a separator, more specifically a dotted line separator. You should read this question first: iTextSharp - Is it possible to set a different alignment in the same cell for text
Then read this question: How to Generate Table-of-Figures Dot Leaders in a PdfPCell for the Last Line of Wrapped Text (or this question It is possible with itext 5 which at the end of a paragraph justified the remaining space is filled with scripts?)
Note that your question is actually off-topic. It's phrased as a "home work" question. It invites people to do your work in your place. Now that you have all the elements you need, you should be able to do the work yourself. If you don't succeed, you should write an on topic Stack Overflow question. That's a question in which you show what you've tried and explain the technical problem you experience.
Update:
You shared a document with the following outline tree:
As you can see, the bookmarks are defined using Named destinations, such as /__WKANCHOR_2, /__WKANCHOR_4, and so on. As you can tell from the / character, the names are stored as PDF name objects (PDF 1.1), not as PDF string objects (since 1.2). The most recent PDF standards recommend to use PDF string objects instead of PDF name objects, you may want to ask the vendor of your PDF generation software to update the software to meet the recommendations of the most recent PDF standards.
Nevertheless, we can easily get the explicit destinations that correspond with those named destinations. They are stored in the /Dests entry of the root dictionary:
When you look at the way the destinations you see another problem that should be reported to wkhtmltopdf. Let's take a look at what the ISO standard tells us about the syntax to be used for destinations:
The concept of page numbers doesn't exist in PDF. Pages are described using page dictionaries, and the page number is derived from the position of the page in the page tree. The first page that is encountered in the page tree is page 1, the second page that is encountered is page 2, and so on.
In your example, the explication destinations are defined like this: [9/XYZ 30.2400000 524.179999 0], [9/XYZ 30.2400000 231.379999 0], and so on.
This is wrong. The ISO standard says that the first value in the array needs to be an indirect reference. An indirect reference has the format 9 0 R, not 9. I looked at the structure of the document, and I see that wkhtmltopdf uses a page number - 1 instead of an indirect reference. If I look at /__WKANCHOR_2, it refers to [0/XYZ 30.240000 781.459999 0] and that 0 is supposed to point to page 1. As Adobe Reader tolerates crappy software, this works in Adobe Reader, but as the file is in violation with ISO-32000, iText doesn't know what to do with those misleading destinations, at least, the convience class SimpleNamedDEstination doesn't know what to do with it.
Fortunately, iText is a very versatile library that allows you to go deep under the hood of a PDF. In this case, we only have to go one level deeper. Instead of SimpleNamedDestination.getNamedDestination(reader, true), we can use the following approach:
HashMap<String, PdfObject> names = reader.getNamedDestinationFromNames();
for (Map.Entry<String, PdfObject> entry: names.entrySet()) {
System.out.print(entry.getKey());
System.out.print(": p");
PdfArray arr = (PdfArray)entry.getValue();
System.out.println(arr.getAsNumber(0).intValue() + 1);
}
reader.close();
The output of this method is:
__WKANCHOR_w: p7
__WKANCHOR_y: p7
__WKANCHOR_2: p1
__WKANCHOR_4: p1
__WKANCHOR_16: p9
__WKANCHOR_14: p8
__WKANCHOR_18: p9
__WKANCHOR_1s: p13
__WKANCHOR_a: p2
__WKANCHOR_1q: p13
__WKANCHOR_1o: p12
__WKANCHOR_12: p8
__WKANCHOR_1m: p12
__WKANCHOR_e: p3
__WKANCHOR_10: p7
__WKANCHOR_1k: p12
__WKANCHOR_c: p3
__WKANCHOR_1i: p11
__WKANCHOR_i: p4
__WKANCHOR_8: p2
__WKANCHOR_g: p3
__WKANCHOR_1g: p11
__WKANCHOR_6: p1
__WKANCHOR_1e: p10
__WKANCHOR_m: p5
__WKANCHOR_1c: p10
__WKANCHOR_k: p4
__WKANCHOR_q: p5
__WKANCHOR_1a: p9
__WKANCHOR_o: p5
__WKANCHOR_u: p6
__WKANCHOR_s: p6
If we check __WKANCHOR_2, we see that it correctly points at page 1. I checked the final link in the outlines, it points at the named destination with name __WKANCHOR_1s and indeed: that should link to page 13.
Your problem is a clear example of a "garbage in-garbage out" problem. Your tool produces PDFs that are in violation with the ISO standard for PDF, and as a result you lose plenty of time trying to figure out what's wrong. But what's even worse: you made me lose time because of someone else's fault.

Java PDFBox Multi-Page Document With Headings and Table of Contents

The PDFBox v2.0 is still growing and doesn't have any nice and easy examples to get started with.
I need to create a multi-page PDF dynamically from an object; with TableOfContents and Headings!
How do I create Numbered Headings? (Increasing the font size is not an option because the TableOfContents has to know their location[page number] in the document)
Example:
Table of Contents
Document Title Here
1- Intro......................................................1
2- Heading....................................................2
2.1- SubHeading1 ........................................2
2.2- SubHeading2 ........................................5
Page 1
Document Title Here
1- Intro
This is an intro to the document....
2- Heading
2.1- Subheading
Some text here...
........
Page 2
I have two problems:
I followed this example here: PDFBox - how to create table of contents but it didn't create any TableOfContents.
I'm getting this exception:
java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object
at org.apache.pdfbox.pdmodel.interactive.action.PDActionGoTo.setDestination(PDActionGoTo.java:90)
I removed the PDActionGoTo as the answer comment said; it didn't give any exception but it didn't create any TableOfContents
I have no idea how to make headings!

How do I get Custom Format Script of Pdf form fields using iText?

I have a pdf form and I am writing some data to the pdf fields through code using iText & Java. I want to get Custom Format Script of the fields so that I get to know the valid inputs for the pdf fields. Thanks

Update:
OOPS,
I misinterpreted your question. I assumed you wanted to add JavaScript to an existing PDF so that the user can add valid data.
What you are looking for is a way to extract the JavaScript from an existing PDF. You already get an impression on how to do that in the answer I referred to in my previous answer. For JavaScript added to specific fields, you need to inspect the Additional Actions:
AcroFields form = stamper.getAcroFields();
AcroFields.Item fd = form.getFieldItem("married");
// Get the PDF dictionary of the YES radio button and add an additional action
PdfDictionary dictYes =
(PdfDictionary) PdfReader.getPdfObject(fd.getWidgetRef(0));
PdfDictionary yesAction = dictYes.getAsDict(PdfName.AA);
if (yesAction == null) yesAction = new PdfDictionary();
yesAction.put(new PdfName("Fo"),
PdfAction.javaScript("setReadOnly(false);", stamper.getWriter()));
dictYes.put(PdfName.AA, yesAction);
// Get the PDF dictionary of the NO radio button and add an additional action
PdfDictionary dictNo =
(PdfDictionary) PdfReader.getPdfObject(fd.getWidgetRef(1));
PdfDictionary noAction = dictNo.getAsDict(PdfName.AA);
If noAction isn't null, you'll need to examine the different values in that dictionary. For instance: the /Bl entry (if present) will give you the action that is triggered on blur, the /Fo entry (if present) will give you the action that is triggerd on focus, and so on.
If you want to get the document-level JavaScript, you need to fetch the appropriate entry from the name tree in the Catalog of the PDF document. It is hard to explain in words, but if you download RUPS and inspect the PDF, you should be able to find the different JavaScript snippets. If we look at the file created using my incorrect answer (see below), we get this:
This shows that you need something like this:
PdfDictionary catalog = reader.getCatalog();
PdfDictionary names = catalog.getAsDict(PdfName.NAMES);
PdfDictionary javascript = names.getAsDict(PdfName.JAVASCRIPT);
Once you have this javascript dictionary, you can extract all the Javascript snippets.
Incorrect answer:
I assume that you know how to write JavaScript (or more correctly ECMAScript). JavaScript in PDF is very similar to JavaScript in HTML. I assume you don't need help to write some JavaScript methods to check if input is valid.
If that is the case, you only need to know how to add the JavaScript to an existing PDF file. For instance: I have this PDF file named form_without_js.pdf to which I want to add some javascript, for instance extra.js. In extra.js, you'll find a method that sets a field to read-only as well as a method that validates a field: if the value of married is yes and there is no value for partner, an alert is shown, otherwise the form is submitted. You'll have to write similar JavaScript depending on the nature of your form and which fields you want to check.
The AddJavaScriptToForm example shows you how to add these JavaScript methods to the existing PDF, resulting in the file form_with_js.pdf.
This is done with PdfReader and PdfStamper:
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
// do stuff
stamper.close();
reader.close();
Where it says // do stuff, you need to do two things:
[1.] You need to add the JavaScript snippet like this:
PdfWriter writer = stamper.getWriter();
writer.addJavaScript(Utilities.readFileToString(RESOURCE));
[2.] You need to add some JavaScript to specific fields to call the custom methods you've added.
In the example, you see a case where we add JavaScript as an additional action. You also see how we add a new button with a specific action. It's up to you to decide what is needed in your specific case.

Accessing "alternate text" for an image via PDFBox

Is there some way to extract "alternate text" for a specific image using PDFBox?
I have a PDF file which, as described at http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1, has had alternate text added to an image. Using PDFBox I can find my way through the object model to the image itself (a PDXObjectImage) through PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() but I can not see any way to get from the image itself to the alternate text for it.
A small sample PDF (with a single image which has some alternate text specified) can be found at http://dl.dropbox.com/u/12253279/image_test_pass.pdf (It should say "This is the alternate text for the image.").

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.
Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).
Assuming you have access to the internal structure of the PDF file, you will need to:
1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.
Page content:
BT
/P <</MCID 0 >>BDC
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC
ET
/Figure <</MCID 1 >>BDC
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC
Your image:
2- In the page object, retrieve the key StructParents.
3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.
4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.
Looks easy, isn't it?
Tools used in this answer:
PDF Vole (based on iText)
Amyuni PDF Analyzer

Eric from the PDFBox mailing list sent me the following, though I've not tested it out yet...
Hi,
For your test file, here is a way to access "/Alt" entry :
PDDocument document = PDDocument.load("image_test_pass.pdf");
PDStructureTreeRoot treeRoot =
document.getDocumentCatalog().getStructureTreeRoot();
// get page for each StructElement
for (Object o : treeRoot.getKids()) {
if (o instanceof PDStructureElement) {
PDStructureElement structElement = (PDStructureElement)o;
System.out.println(structElement.getAlternateDescription());
PDPage page = structElement.getPage();
if (page != null) {
page.getResources().getImages();
}
}
}
Please refer to the PDF specification http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf and in particular §14.6, §14.7,
§14.9.3 and §14.9.4 to know all the rules in order to find the "/Alt"
entry. There seems to have several way to define this information.
BR,
Eric

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.