I would like to delete blank pages before I save the data to a pdf file. Is there a simple way to do that?
Word documents are not fixed page formats, they are flow, more like HTML. So, there is no easy way to determine where page starts or ends and no easy way to determine whether some particular page is blank.
However, there are few options to set n explicit page break in Word document. For example, explicit page break
https://apireference.aspose.com/words/java/com.aspose.words/controlchar#PAGE_BREAK
PageBreakBefore paragraph option.
https://apireference.aspose.com/words/java/com.aspose.words/ParagraphFormat#PageBreakBefore
section break
https://docs.aspose.com/words/java/working-with-sections/
If you delete such explicit page breaks from your document, this might help you to get rid blank pages.
I am using Itext 5.5 and right now, I have a custom implementation of PdfPageEventHelper that adds a footer to the page containing Page number information.
Recent changes in my application lead to the existence of necessary footnotes. The way I am creating the PDF (dynamically created from a list of Components) makes it effectively impossible to determine which page contains which items, as that is part of customizable Styling options.
However, I need to add explanations to the footnote markers.
The approach I have now is to simply notify the PdfPageEventHelper that, somewhere in the document, there is at least one element that needs the (currently only) footnote, and then I add the explanatory footnote to every Page.
This is something I want to avoid, as the future might bring more footnotes and explanations.
So the question is:
Can I parse the current page content directly and scan for the existence of marker text? Or is there another way to see if the current page needs the explanatory footnote?
My failed approaches so far (all in onEndPage(PdfWriter, Document)):
PdfContentByte cb = writer.getDirectContent();
PdfReader reader = new PdfReader(cb.toPdf(writer());
// this led to InvalidPdfException
----
ColumnText ct = new ColumnText(cb);
ct.getCompositeElements();
// returned null, I expected the current page contents
----
OutputStreamCounter oc = writer.getOs();
// did not expose any useful methods. also, cannot read from OutputStream
Googling the problem yielded dozens of results - how to add a page number or how to add a document-static, user-specific header. But nothing page-depending.
Oh, and this, which is not really helpful:
Adding a pdf footer conditionally on certain pages in a multi-page pdf document
which seems basically to be the exact same problem as mine.
Essentially you'll have to do it the other way around:
Simply add a generic tag to the chunk of a footnote marker. Then your page event listener is informed about this generic tag between the start of the page and the end of it. If you set a flag in onGenericTag, therefore, your onEndPage method merely has to check (and later reset) that flag and add the footnote accordingly.
You can even use the generic tag text to differentiate between different markers and only add the matching footnotes.
For an example use of generic tags, have a look at examples using the Chunk.setGenericTag(String) method, e.g. the sandbox example GenericFields.
(I originally referenced the iText site URL https://developers.itextpdf.com/examples/page-events-itext5/page-events-chunks but due to a restructuring of that site it leads nowhere specific anymore; but you can still find a copy of the original page using the wayback machine.)
Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.
Trying to extract nested sections from an html page. I want to eventually create wiki pages for each sections. Extracting the text only will not be an issue, but extracting the nested sections will be.
The page I am trying to extract sections from is - http://goo.gl/xb7Ydd
I am planning to extract the sections into an XML(or a json?) which can look something like this -
<1.1> Section 1.1
<1.1.1> Subsection of 1.1 </1.1.1>
<1.1.2> Subsection of 1.1 </1.1.2>
</1.1>
Can anyone suggest approaches other than complex regexs?
Use requests and Beautifulsoup
import requests
from bs4 import BeautifulSoup
r=requests.get("http://docs.oasis-open.org/cmis/CMIS/v1.1/os/CMIS-v1.1-os.html") # get page using requests
soup=BeautifulSoup(r.content)
s = soup.find_all(text=re.compile('\.pdf'))# find all .pdf's
print s
[u'http://docs.oasis-open.org/cmis/CMIS/v1.1/os/CMIS-v1.1-os.pdf', u'http://docs.oasis-open.org/cmis/CMIS/v1.1/csprd01/CMIS-v1.1-csprd01.pdf', u'http://docs.oasis-open.org/cmis/CMIS/v1.1/CMIS-v1.1.pdf']
EDIT1: You can skip to Step 3 to see how I handled parsing the HTML.
I'm starting with the assumption that you haven't used an HTML parser before. Forgive me if the following comes off as patronizing or verbose; I'm just attempting to be thorough.
Before we get started, I should let you know that we're dealing with malformed HTML here (e.g., not a single </p> in sight), so we can't take a straight-forward approach. We're also dealing with a large file, so the going might be a little slow and error-prone.
If you choose to use Python for your project, there's a popular module called BeautifulSoup that can help with the bad HTML. If we wanted to use it to capture the text from a section of your page (say, the intro), we would follow these steps:
Acquire the HTML
Find relevant tags, id's and classes
Write a script to extract info and load into XML
Step 1
First, you need to look at the page's HTML. In your browser, after pulling the page up, right-click the page and hit "View page source". In the window that pops up, you should see a wall of text (over 38000 lines). Save the text to a file called "CMIS-v1.1-os.html", and place that file in your C drive's root folder (for convenience; put it somewhere else if you're not using Windows). We're gonna look through this to find what we'll need for the parser.
Step 2
Let's try to get the text of the introduction. When I search for the first line of text ("The Content Management Interoperability Services..."), I'm taken to line 129. This is the abstract, so I move to the next instance: Line 617. This is the text I'm looking for, so now I need to find out how it's tagged in HTML.
Look back at the page in your browser (the rendered page, not the wall). This section is headed with the phrase "1 Introduction" (line 615). In the wall, it's tagged 'h2' (i.e. it's between <h2> and </h2>. Now, 'h2' tags are pretty common in the wall, but this tag has the 'chapterHead' class, which isn't as common. Beyond that, it looks as if the header and the text is inside of a 'p' tag with the 'noindent' class. (Note: I'm typing as I go; some of this info might not be useful in whatever code we come up with, but it should help you with your ideas).
We know how to find a chapter (search for the 'chapterHead' class). Now let's find a chapter section. On the rendered page, I see "1.4 Examples". Searching the wall for the first line ("A set of request and response...") takes me to Line 878. In Line 876, I see the section's title in an 'h3' tag with the 'sectionHead' class. This strategy will also work for finding the tags that mark subsections and sub-subsections.
Ideally, each of the chapter sections would be contained in tags denoting a chapter, the subsections would be in tags denoting a section, etc. This ideal structure could be represented like this:
1
1.1
1.1.1
1.1.2
1.2
1.3
1.3.1
2
2.1
Unfortunately for us, that structure doesn't exist in this document. What we have is:
1
1.1
1.1.1
1.1.2
1.2
1.3
1.3.1
2
2.1
We'll have to go through these tags sequentially, reacting to the tags as we come upon them.
Based on what we've seen, our strategy will look like this:
Find an 'h2' tag with the 'chapterHead' class
Go through the <p>s that we find, keeping an eye out for 'h3', 'h4', and 'h5' tags
Store text until we come across another 'h2' tag.
Step 3
If you don't have Python installed, I'd go here and download Python 3.x. Once that's done, open up a command console and execute the following:
pip install beautifulsoup4
Now, run python in the console. Whatever we do next can either be typed line-by-line into the console or stored in a .py file and run later.
To get set up:
import bs4
from xml.etree.ElementTree import Element, SubElement, tostring
# Change the path if you're not using Windows
with open('C:\CMIS-v1.1-os.html') as f:
soup = bs4.BeautifulSoup(f.read())
The first two lines import the modules we'll need. The last two lines just read the contents of your HTML file into a BeautifulSoup class. soup has a lot of methods that make searching for tags and classes easy. These lines might take a while to execute because your file is so large.
first_chapter = soup.find('h2', class_='chapterHead')
first_chapter holds the first 'h2' tags and its contents. Since this tag is on the same level as the other chapters, sections, and subsections, we'll use it as a starting point.
siblings = first_chapter.next_siblings
element = next(siblings)
siblings is a generator that'll deliver the contents of the document to us piecemeal. If you type print(element) in the console, you should see the text that follows the 'h2' tag (i.e., the first paragraph of the introduction). Ideally, this text would be inside a 'p' tag, but we're not dealing with well-formed HTML.
Every time that you run element = next(siblings), the generator delivers the element that follows on the same level of the document tree (it won't deliver any embedded tags). If we want to know the name of the tag, we can get it with element.name; if the element is text or a comment tag, it'll have no name.
Now, in order to build XML, we'll use xml.etree.ElementTree.
intro = Element('1')
intro.text = element
sect1 = SubElement(intro, '1.1')
print(tostring(intro))
intro.remove(sect1)
If you type tostring(intro) into your console, you should see some nice-looking XML. We'll remove sect1 for now.
Lastly, we need a way to track the level of text that we're on as we iterate through siblings. From what I saw of your page, subsections go up to four levels below the chapter level. Let's set up some control structure that checks the names of tags in order to determine what level we're at (this is meant for inspirational purposes):
depth = {'h2': 1, 'h3': 2, 'h4': 3, 'h5': 4, 'h6': 5}
depth_tags = ['h3', 'h4', 'h5', 'h6']
level = [1]
sects = [intro]
xml_tag = ''
last_sect = intro
for element in siblings:
old_depth = len(level)
tag = element.name
if tag == 'h2':
break
if tag in depth_tags:
if old_depth < depth[tag]:
level.append(1)
else:
level = level[0:depth[tag]]
level[-1] += 1
sects = sects[0:depth[tag]-1]
last_sect = sects[-1]
xml_tag = '.'.join(level)
sects.append(SubElement(last_sect, xml_tag))
sect = sect[-1]
if tag == 'p':
sect.text = element.text
I'll explain the logic:
Initialize useful variables:
depth for converting a header tag to a depth level (assuming that the headers correspond exactly to a subsection type; if this assumption is wrong, you might try checking the element's class using element.get('class')).
depth_tags for storing the tags that signify a change in depth (excluding 'h2' because we're only doing the introduction).
level for keeping track of the current depth level.
sects for keeping track of the current XML node to which we may add text.
xml_tag for storing the label of the current XML node.
last_sect for storing the parent of the current XML node.
Iterate through all of the elements, starting with the element after the first paragraph (which we captured earlier in code).
If the next element is named 'h2', we know that we've reached the end of the current chapter, so we'll call it quits.
If the next element is one of the depth_tags, we'll change the XML node appropriately:
If the element's level is lower than the previous element's level (we're going deeper into the tree), we'll add a '1' to our level list (so if the list was [1, 1] before, it'll now be [1, 1, 1]).
If the element's level is the same or higher than the previous element's level (we're going sideways or higher into the tree), we'll shorten our level list to the length corresponding to our level; we'll also increment the last number of the updated list. (Example: the last tag was 'h5', and the current tag is 'h3'. If the level list was [1, 1, 3, 2] before, it'll now be [1, 2].) We'll also shorten the sects list to make room for a new XML node.
We'll add a new XML node to the sects list. '.'.join(level) turns [1, 2, 3] into "1.2.3".
If the next element is a 'p' tag, we'll write its contents into the current XML node.
I haven't tested this, but even if it works perfectly, I'm certain you'll need to be careful with handling text outside of 'p' tags (like the first paragraph we found).
Closing Thoughts
I've gone over the basics of what you can use to get your (truly ugly-looking) HTML turned into XML. Some other points I didn't cover:
There are many comment tags in the HTML. These are represented in BeautifulSoup as elements with no name but text content.
You'll probably notice a lot of 'span' and 'a' tags in an element. Using element.text will make them go away, which is useful for just getting the text. If you want to leave them in, try ''.join(str(t) for t in element.contents).
Just so you know, the characters 'fi' (two characters) show up as 'fi' (one character) in your page. Just sayin'.
I got a word file and I want to count how many pages are in it.
The file has been created with Docx4Java.
Anyone did this before?
Thanks!
docx4j doesn't have a page layout model, so it can't tell you a page count.
You can get an approximate page count by using FOP's page layout model. docx4j's PDF output now supports a "2 pass" generation:
first pass calculates the page count(s)
second pass generates the pdf
See https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/AbstractPlaceholderLookup.java
and
https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/out/fo/ApacheFORenderer.java
So doing the first pass would give you (approximately) what you want. This uses org.apache.fop.apps.FormattingResults which records the number of pages in a page sequence, or in the document as a whole.
An alternative approach might be to use LibreOffice/OpenOffice (or Microsoft Word, for that matter).