count the number of pages in a docx-document - java

I got a word file and I want to count how many pages are in it.
The file has been created with Docx4Java.
Anyone did this before?
Thanks!

docx4j doesn't have a page layout model, so it can't tell you a page count.
You can get an approximate page count by using FOP's page layout model. docx4j's PDF output now supports a "2 pass" generation:
first pass calculates the page count(s)
second pass generates the pdf
See https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/AbstractPlaceholderLookup.java
and
https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/ApacheFORenderer.java
So doing the first pass would give you (approximately) what you want. This uses org.apache.fop.apps.FormattingResults which records the number of pages in a page sequence, or in the document as a whole.
An alternative approach might be to use LibreOffice/OpenOffice (or Microsoft Word, for that matter).

Related

How can I get rid of blank pages in aspose.words?

I would like to delete blank pages before I save the data to a pdf file. Is there a simple way to do that?
Word documents are not fixed page formats, they are flow, more like HTML. So, there is no easy way to determine where page starts or ends and no easy way to determine whether some particular page is blank.
However, there are few options to set n explicit page break in Word document. For example, explicit page break
https://apireference.aspose.com/words/java/com.aspose.words/controlchar#PAGE_BREAK
PageBreakBefore paragraph option.
https://apireference.aspose.com/words/java/com.aspose.words/ParagraphFormat#PageBreakBefore
section break
https://docs.aspose.com/words/java/working-with-sections/
If you delete such explicit page breaks from your document, this might help you to get rid blank pages.

How to read a docx file and know when the page changes? APACHE POI

I need to know how MS WORD page the documents. is it possible to know when MS WORD goes head-to-head in a new page with java APACHE POI?
No because MS Word calculates page breaks on the fly except for the ones you insert manually. Sometimes MS Word will leave artifacts in the XML to let you know where the page break was last time they were calculated, but those artifacts don't have to be there, so you can't use them as a way to get a bit of text from the third paragraph on the second page. You have to render the document (POI won't help you there) and at that time you can calculate the page breaks. Also, since you have nearly zero chance of rendering the document in exactly the same way MS Word does, You might end up rendering the document with a word or line on a different page than MS Word does.

How to repeat the whole page with content control data binding (docx)

I use the docx4j for creating documents fed with XML data. The ContentControlBindingExtensions example shows how to use a simple for loop over the data to generate rows in invoice for each item from the XML file.
However, I can not find any way to repeat the whole page per each item (let's say my XML contains people and there should be one page per each person). When using the authoring add-in for Word (suggested here) I can't select the whole page to put the for loop on.
I thought I can insert a Page Break (Ctrl+Enter) at the end of the template and select it inside a for loop. However, this results in one empty line at the top of every page but the first.
You can put a hard page break (Word: Insert > Page Break) inside a rich text content control.
You can even put a Section Break inside a rich text content control, and this can be of type "Next Page".
So as long as your content is less than a page, you'll get a whole page per item.

Apache POI: Retrieve page number from XWPFParagraph instance?

I am iterating over XWPFParagraph instances coming from an XWPTDocument instance (using the "getParagraphs()" method) Is there a way to retrieve the page numbers where each paragraph is located from the XWPFParagraph instances?
To eventually turn Gagravarr's comment into a proper answer: No, this is not possible.
Doing so would require a full-blown Word rendering engine (i.e. MS Word itself) and even then you cannot be absolutely sure that page breaks will always occur at exactly those positions where they happened to be when the file was once created (think: missing fonts, missing pictures, different display options for vanished text and/or revision marking, different printer margins, etc.).
So claiming that some content in a Word file is on a certain line X on a certain page Y actually expresses a fundamental misunderstanding of the Word file format. There is simply no notion of line and page in there. It's all about runs resp. ranges.
In other words: Only upon opening such a file with MS Word will those contents be rendered onto individual lines / pages. And the behavior of this renderer unpredictable to a certain extent.

Similar content on multiple pages -- Jasper report

My boss is killing me with the 'awesome' idea to generate Jasper report. As long as I am still alive, I have to ask experts here about how to make it happen.
The original requirement is to generate a pdf report with text and form(tables) in one page. I did it.
But the new requirement ask for a new report which contains four pages and each one contains the previous one page content, with only one change(insert different text on every page).
I did some research and didn't find a easy way to do it. So hope everyone on SO can give me a hint. Thanks a lot !
Could you hide and show columns based on expression? For example, map fields over fields in your template to account for all 4 pages, and for a given field only show data when PAGE_NUMBER = x or COLUMN_NUMBER = x.
In your query, you could create a constant to identify the page where each row of data should be printed and group on that constant.
Or -- 4 detail bands, and each is set to print only on the appropriate page.
If its the same text, then add the Text in String objects,
then you can use the same report, with different data-objects, which store the Strings for the text

Categories