Apache POI: Retrieve page number from XWPFParagraph instance? - java

I am iterating over XWPFParagraph instances coming from an XWPTDocument instance (using the "getParagraphs()" method) Is there a way to retrieve the page numbers where each paragraph is located from the XWPFParagraph instances?

To eventually turn Gagravarr's comment into a proper answer: No, this is not possible.
Doing so would require a full-blown Word rendering engine (i.e. MS Word itself) and even then you cannot be absolutely sure that page breaks will always occur at exactly those positions where they happened to be when the file was once created (think: missing fonts, missing pictures, different display options for vanished text and/or revision marking, different printer margins, etc.).
So claiming that some content in a Word file is on a certain line X on a certain page Y actually expresses a fundamental misunderstanding of the Word file format. There is simply no notion of line and page in there. It's all about runs resp. ranges.
In other words: Only upon opening such a file with MS Word will those contents be rendered onto individual lines / pages. And the behavior of this renderer unpredictable to a certain extent.

Related

Split docx to multiple docx using Java

I have a requirement to split 1 docx to multiple docx based on subheadings.
where input document have TOC, graphs, paragraphs, tables , images and drawing tools .
I have a write a app to get a docx and generate multiple docx based on subheading.
I could see few resource for paragraph read and write but couldn't find for others. any suggestions to clone the doc and write as is in order to maintain the same style and format.
Thanks in advance
There are at least 2 ways to do this. The first is to use a clone of the entire document, but only including the relevant portion of the main document part. This is fairly easy to do, but the output documents might be large (since they contain unused images etc), unless you open/save in Word.
The second would be to use our commercial Docx4j Enterprise. You still have to identify where each chunk starts and finishes, but it will take just the objects referenced in that chunk (so you get small output documents).

How to read a docx file and know when the page changes? APACHE POI

I need to know how MS WORD page the documents. is it possible to know when MS WORD goes head-to-head in a new page with java APACHE POI?
No because MS Word calculates page breaks on the fly except for the ones you insert manually. Sometimes MS Word will leave artifacts in the XML to let you know where the page break was last time they were calculated, but those artifacts don't have to be there, so you can't use them as a way to get a bit of text from the third paragraph on the second page. You have to render the document (POI won't help you there) and at that time you can calculate the page breaks. Also, since you have nearly zero chance of rendering the document in exactly the same way MS Word does, You might end up rendering the document with a word or line on a different page than MS Word does.

PDF font subsetting and subset merging in Java

I have a part in my code where I am programatically filling out PDF forms using iText Java based on user-entered data, and then I concat a number of such PDFs into one using iText again.
The PDF forms that are getting merged can be (and usually are) different.
The resulting PDF is way too large - looking at it, 98% of the space is taken by fonts.
The way I understand it, what happens is that the individual PDF forms have different font subsets, so when I merge them, I get massive amount of duplicate glyphs, except that the subsets are not identical, so I can't get rid of them without merging the subsets.
The other problem is that the PDF forms themselves might not even contain subsets, but heavily packed fonts that have 2000+ glyphs, so even if I manage to leave only one instance of that font in the PDF, that still can be many megabytes. Hence it seems that I need to be able to 1) create and 2) merge existing font subsets.
The quirk is that I do not control neither the PDF forms (that are being filled out) nor their number, nor the order in which they are concatenated, so it is not possible to solve this by controlling what kind of fonts are embedded in them.
Adobe Acrobat can of course solve such a problem - it can create and also merge font subsets - but I need a programatic, server-side solution. According to google hits, iText cannot do this. Is there another library that I could use (or anything else I can do)?

count the number of pages in a docx-document

I got a word file and I want to count how many pages are in it.
The file has been created with Docx4Java.
Anyone did this before?
Thanks!
docx4j doesn't have a page layout model, so it can't tell you a page count.
You can get an approximate page count by using FOP's page layout model. docx4j's PDF output now supports a "2 pass" generation:
first pass calculates the page count(s)
second pass generates the pdf
See https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/AbstractPlaceholderLookup.java
and
https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/ApacheFORenderer.java
So doing the first pass would give you (approximately) what you want. This uses org.apache.fop.apps.FormattingResults which records the number of pages in a page sequence, or in the document as a whole.
An alternative approach might be to use LibreOffice/OpenOffice (or Microsoft Word, for that matter).

Long text input from user and PDF generation

I have built a web application that can be seen as an overcomplicated application form. There are bunch of text areas with a given character limit. After the form submission various things happen and one of them is PDF generation.
The text is queried from the DB and inserted in the PDF template created in iReports. This works fine but the major pain is overflowing text.
The maximum number of characters is set based on 'average' text. But sometimes people prefer to write with CAPS or add plenty of linefeeds to format their text. These then cause user's text to overflow the space given in PDF. Unfortunately the PDF document must look like a real application form so I cannot allow unlimited space.
What kind of approaches you have used to tackle this?
Clean/restrict user input?
Calculate the space requirement of the text based on font metrics?
Provide preview of the PDF? (too bad users are not allowed to change their input after submission...)
Ideally, calculate the requirement based on metrics. I don't know how iReports handles text, but with iText, it lays everything out itself, you just present the data as a streaming document, so we don't worry about overflowing text.
However, iReport may not support that, or you may need to have the PDF layout fit within certain bounds. I'd try to clean the input (ie: if it's all caps, lowercase/sentence case/proper case it), strip extra whitespace. If cleaning the input can't be reliably done, or people are still getting past that, I'd also restrict it.
As a last resort, I'd present the PDF for the user to authorize. Really, users shouldn't be given more work to do, and they're not going to do it anyways.
Your own suggested solutions to your problem are all good. Probably the most important question to have answered is what should your PDF look like when the data to be displayed in a field won't fit? Do you ever need the "full answer" for anything else? When you know the answer to these, you'll have your options reduced.
For example if a field must be limited to 1/2 a page, and users sometimes enter more than 1/2 a page of text you can either
1) limit the user input - on submission calculate the size (using font-metrics as you said) and reject the submission until corrected. This assumes you can legitimately force the user to reduce their data entry.
2) accept the user input and truncate in the display of this report. Some systems use "..." to indicate data has been truncated, and can provide a hyperlink (even within the PDF) to get more information.
Providing a preview would work really well, but only if the users are good at checking and correcting and your system can handle the extra load this will generate.
Do you have control of the font that is used when generating the PDF? If so, I would look for a font in the Monospace family. This will give you consistent length for a given number of chars, regardless of puncuation, capitalization, etc.

Categories