How can I get rid of blank pages in aspose.words? - java

I would like to delete blank pages before I save the data to a pdf file. Is there a simple way to do that?

Word documents are not fixed page formats, they are flow, more like HTML. So, there is no easy way to determine where page starts or ends and no easy way to determine whether some particular page is blank.
However, there are few options to set n explicit page break in Word document. For example, explicit page break
https://apireference.aspose.com/words/java/com.aspose.words/controlchar#PAGE_BREAK
PageBreakBefore paragraph option.
https://apireference.aspose.com/words/java/com.aspose.words/ParagraphFormat#PageBreakBefore
section break
https://docs.aspose.com/words/java/working-with-sections/
If you delete such explicit page breaks from your document, this might help you to get rid blank pages.

Related

How to repeat the whole page with content control data binding (docx)

I use the docx4j for creating documents fed with XML data. The ContentControlBindingExtensions example shows how to use a simple for loop over the data to generate rows in invoice for each item from the XML file.
However, I can not find any way to repeat the whole page per each item (let's say my XML contains people and there should be one page per each person). When using the authoring add-in for Word (suggested here) I can't select the whole page to put the for loop on.
I thought I can insert a Page Break (Ctrl+Enter) at the end of the template and select it inside a for loop. However, this results in one empty line at the top of every page but the first.
You can put a hard page break (Word: Insert > Page Break) inside a rich text content control.
You can even put a Section Break inside a rich text content control, and this can be of type "Next Page".
So as long as your content is less than a page, you'll get a whole page per item.

JSOUP Deleting closing and/or opening divs only

Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.

How do I truncate HTML strings to remove broken invalid HTML fragments?

In my Java webapp, I create summary text of long HTML text. In the process of truncation, the HTML fragments in the string often break, producing HTML string with invalid & broken fragments. Like this example HTML string:
Visit this link <img src="htt
Is there any Java library to deal with this better so that such broken fragments as above are avoided ?
Or could I let this be included in the HTML pages & somehow deal with this using client side code ?
Since browsers will usually be able to deal with almost any garbage you feed into it (if it ain't XHTML...), if the only thing that actually happens with the input (assuming it's valid HTML of any kind) is being sliced, then the only thing you have to worry about is to actually get rid of invalid opening tags; you won't be able to distinguish broken 'endings' of tags, since they, in themselves, ain't special in any way. I'd just take a slice I've generated and parse it from the end; if I encounter a stray '<', I'd get rid of everything after it. Likewise, I'd keep track of the last opened tag - if the next close after it wasn't closing that exact tag, it's likely the closing tag got out, so I'd insert it.
This would still generate a lot of garbage, but would at least fix some rudimentary problems.
A better way would be to manage a stack of opened/closed tags and generate/remove the needed/broken/unnecessary ones as they emerge. A stack is a proper solution since HTML tags musn't 'cross' [by the spec, AFAIR it's this way from HTML 4], i.e. <span><div></span></div> isn't valid.
A much better way would be to splice the document after first parsing it as SGML/HTML/XML (depends on the exact HTML doctype) - then you could just remove the nodes, without damaging the structure.
Note that you can't actually know if a tag is correct without providing an exact algorithm you use to generate this 'garbled' content.
I used owasp-java-html-sanitizer to fix those broken fragments to generate safe HTML markup from Java.
PolicyFactory html_sanitize_policy = Sanitizers.LINKS.and(Sanitizers.IMAGES);
String safeHTML = html_sanitize_policy.sanitize(htmlString);
This seemed to be easiest of all solutions I came across.

count the number of pages in a docx-document

I got a word file and I want to count how many pages are in it.
The file has been created with Docx4Java.
Anyone did this before?
Thanks!
docx4j doesn't have a page layout model, so it can't tell you a page count.
You can get an approximate page count by using FOP's page layout model. docx4j's PDF output now supports a "2 pass" generation:
first pass calculates the page count(s)
second pass generates the pdf
See https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/AbstractPlaceholderLookup.java
and
https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/ApacheFORenderer.java
So doing the first pass would give you (approximately) what you want. This uses org.apache.fop.apps.FormattingResults which records the number of pages in a page sequence, or in the document as a whole.
An alternative approach might be to use LibreOffice/OpenOffice (or Microsoft Word, for that matter).

getting text that will be displayed to user from html

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:
Get all the text that will be displayed to the user in a browser from HTML.
My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).
If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:
items in an ul or option tag could be separated by full stops (or to be honest just ignored).
I am working Java, but would be interested in seeing any code that does this.
I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).
An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.
Any pointers or suggestions very welcome.
Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.
I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.
If you have
<strong> test </strong>
(and other tags that change the look of the test) you could place in conditions to not place a full stop here.
HTML parsers seem to be a reasonable starting point for this.
there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.
They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.
But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.
there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)

Categories