Apache PDFBox 2.0.18 - Comments/Annotations status linking

Apache PDFBox 2.0.18 - Comments/Annotations status linking - java

I need to merge comments taken from many versions of the same pdf file but with different comments, into one PDF file containing all comments.
I take all the comments from the pages and create an arrayList of them, then I simply set this array of comments on the new pdf file and it works pretty well.
The problem is that I also need to create an Excel with all the comments found and together with their "status" (accepted, cancelled, rejected, ecc...).
The status seems to be managed as a separate annotation/comment from PDFBox and I can't find any relation between a comment and its status.
Example:
I have a PDAnnotation object with content "COMMENT 1".
And I have another PDAnnotation object with content "Accepted by user XX" (the status of COMMENT 1).
On Acrobat Reader I see the comment "COMMENT 1" with the status set on "Accepted", so there must be a relation between the two objects, but I can't find it.
Any ideas?

Using the PDFDebugger is a good suggestion, it should give to you an overview of how objects (including PDAnnotations) are linked to each other
Anyway, check if in your child PDAnnotation in the COSDictionary you have a COSBase{IRT} key, that key should contain as value the parent COSObject
So if you do something link this:
COSDictionary parentDict = (COSDictionary) childDict.getDictionaryObject("IRT");
You should get the parent PDAnnotation dictionary and you can take all the data you need
Please notice the cast is necessary since getDictionaryObject returns a COSBase, but the object returned for the IRT key is actually a COSDictionary

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Java PDFBox Multi-Page Document With Headings and Table of Contents

The PDFBox v2.0 is still growing and doesn't have any nice and easy examples to get started with.
I need to create a multi-page PDF dynamically from an object; with TableOfContents and Headings!
How do I create Numbered Headings? (Increasing the font size is not an option because the TableOfContents has to know their location[page number] in the document)
Example:
Table of Contents
Document Title Here
1- Intro......................................................1
2- Heading....................................................2
2.1- SubHeading1 ........................................2
2.2- SubHeading2 ........................................5
Page 1
Document Title Here
1- Intro
This is an intro to the document....
2- Heading
2.1- Subheading
Some text here...
........
Page 2
I have two problems:
I followed this example here: PDFBox - how to create table of contents but it didn't create any TableOfContents.
I'm getting this exception:
java.lang.IllegalArgumentException: Destination of a GoTo action must be a page dictionary object
at org.apache.pdfbox.pdmodel.interactive.action.PDActionGoTo.setDestination(PDActionGoTo.java:90)
I removed the PDActionGoTo as the answer comment said; it didn't give any exception but it didn't create any TableOfContents
I have no idea how to make headings!

MigraDOC and pdfSharp, center dynamic text

i was looking for a solution to create a document (PDF) in such why that i can calculate an input field:
The PDF content should look something like this
(all pdf content needs to be centered)
Header
Field1 : not dynamic (needs to be centered)
Field2 : UserName (dynamic- needs to be appended in the center of the
paragraph) - since each user has a different name length
So my questions is , does pdfSharp or migraDoc has a method or something that can align the text to center (meaning that it does some calculates - determine the font-family, font-size and does the magic so that in the end the marked text is centered) ?
If so what is the method since i've searched the migraDoc and pdfSharp documentation and could not find anything like that.
And if such method does not exist, did someone try this? worked with it? has any suggestions how can i achieve this behavior? maybe some source to look from.
Thank you

Sample 1 shows with a lot of code example how to create a pdf and use almost all functionality that Migradoc offers. Sample 2 shows in detail how to create tables, which might also be interesting for you in respect to layout the page's content.
The alignment (center/left/right) is usually done by setting the Format.Alignment property like this:
Paragraph par = new Paragraph();
par.Format.Alignment = ParagraphAlignment.Center;
A short version of a document with centered content would be:
// first you need a document
Document MigraDokument = new Document();
// each document needs at least one section
Section section = MigraDokument.AddSection();
section.PageSetup.PageFormat = PageFormat.A4;
// and then you add paragraphs to the section
Paragraph par = section.AddParagraph();
// and set the alignment as you wish
par.Format.Alignment = ParagraphAlignment.Center;
// now just fill it with content and set the rest of the parameters...
par.AddText("text");

Merging two .odt files from code

How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?

I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH

Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).

Accessing "alternate text" for an image via PDFBox

Is there some way to extract "alternate text" for a specific image using PDFBox?
I have a PDF file which, as described at http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1, has had alternate text added to an image. Using PDFBox I can find my way through the object model to the image itself (a PDXObjectImage) through PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() but I can not see any way to get from the image itself to the alternate text for it.
A small sample PDF (with a single image which has some alternate text specified) can be found at http://dl.dropbox.com/u/12253279/image_test_pass.pdf (It should say "This is the alternate text for the image.").

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.
Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).
Assuming you have access to the internal structure of the PDF file, you will need to:
1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.
Page content:
BT
/P <</MCID 0 >>BDC
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC
ET
/Figure <</MCID 1 >>BDC
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC
Your image:
2- In the page object, retrieve the key StructParents.
3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.
4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.
Looks easy, isn't it?
Tools used in this answer:
PDF Vole (based on iText)
Amyuni PDF Analyzer

Eric from the PDFBox mailing list sent me the following, though I've not tested it out yet...
Hi,
For your test file, here is a way to access "/Alt" entry :
PDDocument document = PDDocument.load("image_test_pass.pdf");
PDStructureTreeRoot treeRoot =
document.getDocumentCatalog().getStructureTreeRoot();
// get page for each StructElement
for (Object o : treeRoot.getKids()) {
if (o instanceof PDStructureElement) {
PDStructureElement structElement = (PDStructureElement)o;
System.out.println(structElement.getAlternateDescription());
PDPage page = structElement.getPage();
if (page != null) {
page.getResources().getImages();
}
}
}
Please refer to the PDF specification http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf and in particular §14.6, §14.7,
§14.9.3 and §14.9.4 to know all the rules in order to find the "/Alt"
entry. There seems to have several way to define this information.
BR,
Eric

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.