Replacing text in XWPFParagraph without changing format of the docx file - java

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

Related

pdfbox embedding subset font for annotations

I am trying to use Apache PDFBOX v2.0.21 to modify existing PDF documents, adding signatures and annotations. That means that I am actively using incremental save mode. I am also embedding LiberationSans font to accommodate some Unicode characters. It makes sense for me to use the subsetting feature of PDF embedded fonts as embedding LiberationSans in full makes the PDF file around 200+ KB more in side.
After multiple trials and errors I finally managed to have something working - all but the font subsetting. The way I do this is to initialize the PDFont object once using
try (InputStream fs = PDFService.class.getResourceAsStream("/static/fonts/LiberationSans-Regular.ttf")) {
_font = PDType0Font.load(pddoc, fs, true);
}
And then to use custom Appearance Stream to show the text.
private void addAnnotation(String name, PDDocument doc, PDPage page, float x, float y, String text) throws IOException {
List<PDAnnotation> annotations = page.getAnnotations();
PDAnnotationRubberStamp t = new PDAnnotationRubberStamp();
t.setAnnotationName(name); // might play important role
t.setPrinted(true); // always visible
t.setReadOnly(true); // does not interact with user
t.setContents(text);
PDRectangle rect = ....;
t.setRectangle(rect);
PDAppearanceDictionary ap = new PDAppearanceDictionary();
ap.setNormalAppearance(createAppearanceStream(doc, t));
ap.getCOSObject().setNeedToBeUpdated(true);
t.setAppearance(ap);
annotations.add(t);
page.setAnnotations(annotations);
t.getCOSObject().setNeedToBeUpdated(true);
page.getResources().getCOSObject().setNeedToBeUpdated(true);
page.getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getPages().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);
}
private PDAppearanceStream createAppearanceStream(final PDDocument document, PDAnnotation ann) throws IOException
{
PDAppearanceStream aps = new PDAppearanceStream(document);
PDRectangle rect = ann.getRectangle();
rect = new PDRectangle(0, 0, rect.getWidth(), rect.getHeight());
aps.setBBox(rect); // set bounding box to the dimensions of the annotation itself
// embed our unicode font (NB: yes, this needs to be done otherwise aps.getResources() == null which will cause NPE later during setFont)
PDResources res = new PDResources();
_fontName = res.add(_font).getName();
aps.setResources(res);
PDAppearanceContentStream apsContent = null;
try {
// draw directly on the XObject's content stream
apsContent = new PDAppearanceContentStream(aps);
apsContent.beginText();
apsContent.setFont(_font, _fontSize);
apsContent.showText(ann.getContents());
apsContent.endText();
}
finally {
if (apsContent != null) {
try { apsContent.close(); } catch (Exception ex) { log.error(ex.getMessage(), ex); }
}
}
aps.getResources().getCOSObject().setNeedToBeUpdated(true);
aps.getCOSObject().setNeedToBeUpdated(true);
return aps;
}
This code runs, but creates a PDF with dots instead of actual characters, which, I guess, means that the font subset has not been embedded. Moreover, I get the following warnings:
2021-04-17 12:33:31.326 WARN 20820 --- [ main]
o.a.p.pdmodel.PDAbstractContentStream : attempting to use subset
font LiberationSans without proper context
After looking through the source code, I get and I guess that I am messing something up when creating the appearance stream - somehow it's not connected with the PDDocument and the subsetting does not continue normally. Note that the above code works well when the font is embedded fully (i.e. if I call PDType0Font.load with the last parameter set to false)
Can anyone think of some hint to give to me? Thank you!
I don't know - am I lucky? It is very often that luckiness in programming points to something completely wrong or misleading. In any case, if someone can still give a hint, my ears are more than open...
Again, after looking through the code, I saw the following in PDDocument.save():
// subset designated fonts
for (PDFont font : fontsToSubset)
{
font.subset();
}
This is not happening in PDDocument.saveIncremental() which I am using. Just to mess around with the code, I went and did the following just before calling saveIncremental() on my document:
_font.subset(); // you can see in the beginning of the question how _font is created
_font.getCOSObject().setNeedToBeUpdated(true);
pddoc.saveIncremental(baos);
Believe it or not, but the document was saved correctly - at least it appears correct in Acrobat Reader DC and Chrome & Firefox PDF viewers. Note that Unicode codepoints are added to the subset for the font during showText() on appearance content stream.
UPDATE 18/04/2021: as I mentioned in the comments, I got reports from users that started seeing messages like "Cannot extract the embedded font XXXXXX+LiberationSans-Regular from ...", when they opened the modified PDF files. Strangely enough, I didn't see these messages during my tests. It turns out that my copy of Acrobat Reader DC was newer than theirs, and specifically with the continuous release version 2021.001.20149 no errors were shown, while with the continuous release version 2020.012.20043 the above message was shown.
After investigations, it turns out that the problem was with the way I was embedding the font. I am not aware if any other way exists, and I am not that familiar with the PDF specification to know otherwise. What I was doing, as you can see from the above code, was to load the font ONCE for the document, and then to use it freely in the resource dictionary of the appearance stream of EVERY annotation. This had as a result all the resource dictionaries of the annotation content streams to reference an F1 font that was defined with the SAME /BaseFont name. The PDF Reference, 3rd ed. on p.323 specifically states that:
"... the PostScript name of the font - ... - begins with a tag
followed by a plus sign (+). The tag consists of exactly six uppercase
letters; the choice of letters is arbitrary, but different subsets in
the same PDF file must have different tags..."
Once I started to call PDType0Font.load for each of my annotations and calling subset() (and of course setNeedToBeUpdated) after creating appearance stream for each of them, I saw that the BaseName attributes started to look indeed differently - and indeed, the older 2020 version of Acrobat Reader DC stopped complaining.
[edit 07/10/2021: even trying to use a single PDFont object per page (having multiple annotations with this font), and subsetting it once, after having called showText on appearances of all annotations, appears to not work - it appears that the subsetting uses the letters I passed to the first showText, and not the others, resulting in wrong rendering of the 2nd, 3rd etc. annotations that might have characters that didn't exist in the 1st annotation - so I reiterate that what worked was to use loadFont for each separate annotation and then (after modifying appearance with showText, which will mark the letters to be used during subsetting) to call subset() on each of these fonts (which will result in the change of the font name)]
Note that other than using iText RUPS for inspecting the PDF contents, one could use Foxit PDF viewer to at least ensure that the subset font names are different. Acrobat Reader DC and PDF-xChange in Properties -> Fonts just show the initial font name, like LiberationSans, without showing the 6-letter unique prefix.
UPDATE 19/04/2021 I am still working on this issue - because I still get reports about the infamous "Cannot extract the embedded font" message. It is quite possible that the original cause of that message was not (or not only) the fact that the different subsets had same BaseFont names. One thing that I am observing is that on some computers, the stamp annotations that I am using cause Acrobat Reader DC to open automatically the so called "Comments pane" - there are options to turn this automatic thing off (Preferences -> Commenting -> Show comments pane when a PDF with comments is opened). When this pane opens, either manually or automatically, the error message appears (and I was on my wits ends to see why same version of Acrobat Reader DC behaves differently for different machines). I think that Acrobat Reader tries to extract the full version of the font and fails, since it is only a subset. But, I guess, this doesn't have to do with the semantic contents of the document - the document still passes "qpdf --check". I am currently trying to find if it is possible to restrict stamps to not allow comments - i.e. some way to disable the comments pane in Acrobat Reader DC, although I have little hope.
UPDATE 20/04/2021 opened a new question here

Extract textual data from a PDF

I'm using a Java program to extract textual data from a PDF.
When I use this type of PDF I have no problem :
But when I use this type the extraction is not performed :
Have you any idea to resolve this problem?
Try using iText7 and following code:
File inputFile = new File("path_to_your_pdf");
PdfDocument pdfDocument = new PdfDocument(new PdfReader(inputFile));
String text = PdfTextExtractor.getTextFromPage(pdfDocument.getPage(1));
pdfDocument.close();
And let us know what the output is. And whether the output corresponds to what you'd expect.
As #mkl points out, this may simply be the the difference between extracting form-fields or not. In any case, the links to your pdfs would be much appreciated. As well as some code.
But you can of course extract both using iText.
Reading material:
http://developers.itextpdf.com/content/itext-7-examples/itext-7-form-examples
http://developers.itextpdf.com/content/itext-7-examples/itext-7-content-extraction-and-redaction

Add text in MS Word doc using apache POI

I have been trying to edit different types of documents using Apache POI. The script should handle both extensions .doc and .docx. I could successfully edit the .docx file using XWPF api and the required text was added at the end of the docx file.
For editing .doc files(which include header, footer and a few paragraphs), following script is used, which use HWPFDocument.
FileInputStream fis = new FileInputStream(args[0]);
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
CharacterRun run = range.insertAfter("FROM SEHWAGGG A FOUUURRRRRR");
run.setBold(true);
run.setItalic(true);
The script works fine with normal documents which does not have header and footer. But seems that the issue appears with complex documents. It insert text, but in between the paragraphs (and at the beginning using insertBefore()). There are no text replacements required, just have to put the text at the end of the document. I searched similar scripts but most of them handle text replacement.
How can I add the text at the end, after all paragraphs?
I've tested It with the following document:
At first (with your original code) it completely destroyed the document:
By changing the following line, the insert works fine for me:
// Old
Range range = doc.getEndnoteRange();
// New
Range range = doc.getEndnoteRange();
I'm afraid you are out of luck with HWPF with the current state of the project.
I created a custom HWPF library for one of our clients, but the changes are not public. The changes were huge, so you can't spend - say - a week and assume that things will be fixed. You might get away with the current public HWPF when only some text needs to be replaced without changing the string length ("abc" -> "123" or "a " -> "1234").

Merging two .odt files from code

How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?
I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH
Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).

HWPFDocument / XWPFDocument New Lines

I am trying to pull data from microsoft-word and translate it to sql statement and inserting it an Oracle database.
When the data in ms-word contains a new line that is created by [Shift-Enter] and not just enter,
The text contains an icon that looks like a box with a question mark.
Where the ET is just standard new line using the enter key and the ST is new lines using the
Shift-Enter combination. So when generating the SQL and inserting it to oracle, oracle counts that not as a text, but as hex.
My question is, how to remove lines that is created by [shift-enter] to just a standard '\n'?
Thanks
Update
This is how i get the text information
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(file));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
text = we.getText();
Update Answer:
This was a bug in poi-3.6. In poi-3.8 it shows as \r.
What you're almost certainly seeing are "fields" in the word document, which are special blocks of text such as links, macros etc
Option number one is to continue using WordExtractor, but call stripFields(String) on the resulting text before using it. That'll remove any of these fields from the text for you.
The other option is to use a different way of getting the text out. WordToTextConverter is part of Apache POI, and is more complex code that handles more of the format and should skip these for you (WordExtractor is pretty simple and low level). The other is to use Apache Tika, which provides a common way of extracting text from a number of file formats. That does have the proper code to deal with fields, and as an added bonus it'll be trivial for you to support .docx or .pdf when your requirements change!

Categories