Rewritten to look more like a programming question
Okay, so I have done a little more research and it looks like the java package I need to use is docx4j. Unfortunately, my lack of familiarity with the package as well as the underpinnings of the PDF format makes it difficult for me to figure out exactly how to make use of the headers and footers returned SectionWrapper.getHeaderFooterPolicy(). It's not entirely clear whether the HeaderPart and FooterPart objects returned are writeable or how to modify them.
There is this code which offers an example of how to create a header part but it creates a new HeaderPart and adds it to the document.
I want to find existing header/footer parts and either remove them if possible or empty them out. Ideally they would be entirely gone from the document.
This code is similar and allows you to set the text of a headerpart using setJaxbElement but so much of this terminology is unfamiliar and I'm concerned the end result will be me creating headers (albeit empty ones) in each document rather than removing them.
Original Question Below
I am dealing with a set of wildly varying MS Word documents. I am compiling them into a single PDF and want to make sure that none of them have headers or footers before doing so.
Ideally, I'd also like to override their default font if it isn't Times New Roman.
Is there any way to do this programmatically or using some sort of batch process?
I will be running this on a Windows server that doesn't currently have Office or Word installed (although I think it might have an install of OpenOffice, and of course it's easy to just add an install as well).
Right now I'm using some version of iText (java) to convert the files to PDF. I know that apparently iText can't do things like removing headers/footers, but since the underlying structure of modern .doc files is XML, I'm wondering if there is an API (or even a XML parsing/editing API or, if all else fails, a RegEx [horrors]) for removing the headers and footers and setting some default styles.
Here is some code hot off the press to do what you want:
public class HeaderFooterRemove {
public static void main(String[] args) throws Exception {
// A docx or a dir containing docx files
String inputpath = System.getProperty("user.dir") + "/testHF.docx";
StringBuilder sb = new StringBuilder();
File dir = new File(inputpath);
if (dir.isDirectory()) {
String[] files = dir.list();
for (int i = 0; i<files.length; i++ ) {
if (files[i].endsWith("docx")) {
sb.append("\n\n" + files[i] + "\n");
removeHFFromFile(new java.io.File(inputpath + "/" + files[i]));
}
}
} else if (inputpath.endsWith("docx")) {
sb.append("\n\n" + inputpath + "\n");
removeHFFromFile(new java.io.File(inputpath ));
}
System.out.println(sb.toString());
}
public static void removeHFFromFile(File f) throws Exception {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.load(f);
MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
// Remove from sectPr
SectPrFinder finder = new SectPrFinder(mdp);
new TraversalUtil(mdp.getContent(), finder);
for (SectPr sectPr : finder.getSectPrList()) {
sectPr.getEGHdrFtrReferences().clear();
}
// Remove rels
List<Relationship> hfRels = new ArrayList<Relationship>();
for (Relationship rel : mdp.getRelationshipsPart().getRelationships().getRelationship() ) {
if (rel.getType().equals(Namespaces.HEADER)
|| rel.getType().equals(Namespaces.FOOTER)) {
hfRels.add(rel);
}
}
for (Relationship rel : hfRels ) {
mdp.getRelationshipsPart().removeRelationship(rel);
}
wordMLPackage.save(f);
}
}
The above code relies on SectPrFinder, so copy that somewhere.
I've left the imports out, for brevity. But you can copy those from GitHub
When it comes to making the set of docx into a single PDF, obviously you can either merge them into a single docx, then convert that to PDF, or convert them all to PDF, then merge those PDFs. If you prefer the former approach (for example, because end-users want to be able to edit the package of documents), then you may wish to consider our commercial extension for docx4j, MergeDocx.
To remove the header/footer, there is a quite easy solution:
Open the docx as a Zip, and remove the files named header*.xml/footer*.xml (situated in word folder).
Structure of a unzipped docx: https://stackoverflow.com/tags/docx/info
To really remove the link (if you won't do it it will probably corrupted):
You need to edit the document.xml.rels file, and remove all the RelationsShips that include a footer/header. This is a relationship that you should remove:
<Relationship Id="rId13" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer2.xml"/>
and more generally all that contain type='footer' or type='header'
Related
I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}
Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!
I have a piece of Legacy software called Mixmeister that saved off playlist files in an MMP format.
This format appears to contain binary as well as file paths.
I am looking to extract the file paths along with any additional information I can from these files.
I see this has been done using JAVA (I do not know JAVA) here (see aorund ln 56):
https://github.com/liesen/CueMeister/blob/master/src/mixmeister/mmp/MixmeisterPlaylist.java
and Haskell here:
https://github.com/larjo/MixView/blob/master/ListFiles.hs
So far, I have tried reading the file as binary (got stuck); using Regex expressions (messy output with moderate success) and attempting to try some code to read chunks (beyond my skill level).
The code I am using with moderate success for Regex is:
file='C:\\Users\\xxx\\Desktop\\mixmeisterfile.mmp'
with open(file, 'r', encoding="Latin-1") as filehandle:
#with open(file, 'rb') as filehandle:
for text in filehandle:
b = re.search('TRKF(.*)TKLYTRKM', text)
if b:
print(b.group())
Again, this gets me close but is messy (the data is not all intact and surrounded by ascii and binary characters). Basically, my logic is just searching between two strings to attempt to extract the filenames. What I am really trying to do is get closer to something like what the JAVA has in GIT, which is (the code below is sampled from the GIT link):
List<Track> tracks = new ArrayList<Track>();
Marker trks = null;
for (Chunk chunk : trkl.getChunks()) {
TrackHeader header = new TrackHeader();
String file = "";
List<Marker> meta = new LinkedList<Marker>();
if (chunk.canContainSubchunks()) {
for (Chunk chunk2 : ((ChunkContainer) chunk).getChunks()) {
if ("TRKH".equals(chunk2.getIdentifier())) {
header = readTrackHeader(chunk2);
} else if ("TRKF".equals(chunk2.getIdentifier())) {
file = readTrackFile(chunk2);
} else {
if (chunk2.canContainSubchunks()) {
for (Chunk chunk3 : ((ChunkContainer) chunk2).getChunks()) {
if ("TRKM".equals(chunk3.getIdentifier())) {
meta.add(readTrackMarker(chunk3));
} else if ("TRKS".equals(chunk3.getIdentifier())) {
trks = readTrackMarker(chunk3);
}
}
}
}
}
}
Track tr = new Track(header, file, meta);
I am guessing this would either use RIFF or the chunk library in Python if not done using a Regex? Although I read the documentation at https://docs.python.org/2/library/chunk.html, I am not sure that I understand how to go about something like this - mainly I do not understand how to properly read the binary file which has the visible mixed in file paths.
I don't really know what's going on here but I'll try my best and if it doesn't work out then please excuse my stupidity. When I had a project for parsing weather data for a Metar, I realized that my main issue was that I was trying to turn everything into a String type, which wasn't suitable for all the data and so it would just come out as nothing. Your for loop should work just fine. However, when you traverse, have you tried making everything the same type, such as a Character/String type? Perhaps there are certain elements messed up simply because they don't match the type you are going for.
Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?
I have a large XML file and I need to find information in it. This is used in a desktop application and so it should work on systems with few RAM.
So my idea is this: Create an index and then quickly access data from the large file.
I can't just split the file because it's an official federal database that I want to use unaltered.
Using a XMLStreamReader I can quickly find some element and then use JAXB for unmarshalling the element.
final XMLStreamReader r = xf.createXMLStreamReader(filename, new FileInputStream(filename));
final JAXBContext ucontext = JAXBContext.newInstance(Foo.class);
final Unmarshaller unmarshaller = ucontext.createUnmarshaller();
r.nextTag();
while (r.hasNext()) {
final int eventType = r.next();
if (eventType == XMLStreamConstants.START_ELEMENT && r.getLocalName().equals("foo")
&& Long.parseLong(r.getAttributeValue(null, "bla")) == bla
) {
// JAX-B works just fine:
final JAXBElement<Foo> foo = unmarshaller.unmarshal(r,Foo.class);
System.out.println(foo.getValue().getName());
// But how do I get the offset?
// cache.put(r.getAttributeValue(null, "id"), r.getCursor()); // ???
break;
}
}
But I can't get the offset. I'd like to use this to prepare an index:
(id of element) -> (offset in file)
Then I should be able use the offset to just unmarshall from there: Open file stream, skip that many bytes, unmarshall.
I can't find a library that does this. And I can't do it on my own without knowing the position of the file cursor. The javadoc clearly states that there is a cursor, but I can't find a way of accessing it.
Edit:
I'm just trying to offer a solution that will work on old hardware so people can actually use it. Not everyone can afford a new and powerful computer. Using StAX I can get the data in about 2 seconds, which is a bit long. But it doesn't require RAM. It requires 300 MB of RAM to just use JAX-B. Using some embedded db system would just be a lot of overhead for such a simple task. I'll use JAX-B anyway. Anything else would be useless for me since the wsimport-generated classes are already perfect. I just don't want to load 300 MB of objects when I only need a few.
I can't find a DB that just needs an XSD to create an in-memory DB, which doesn't use that much RAM. It's all made for servers or it's required to define a schema and map the XML. So I assume it just doesn't exist.
You could work with a generated XML parser using ANTLR4.
The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.
1. Get XML Grammar
cd /tmp
git clone https://github.com/antlr/grammars-v4
2. Generate Parser
cd /tmp/grammars-v4/xml/
mvn clean install
3. Copy Generated Java files to your Project
cp -r target/generated-sources/antlr4 /path/to/your/project/gen
4. Hook in with a Listener to collect character offsets
package stack43366566;
import java.util.ArrayList;
import java.util.List;
import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;
import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;
public class FindXmlOffset {
List<Integer> offsets = null;
String searchForElement = null;
public class MyXMLListener extends XMLParserBaseListener {
public void enterElement(XMLParser.ElementContext ctx) {
String name = ctx.Name().get(0).getText();
if (searchForElement.equals(name)) {
offsets.add(ctx.start.getStartIndex());
}
}
}
public List<Integer> createOffsets(String file, String elementName) {
searchForElement = elementName;
offsets = new ArrayList<>();
try {
XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
CommonTokenStream tokens = new CommonTokenStream(lexer);
XMLParser parser = new XMLParser(tokens);
DocumentContext ctx = parser.document();
ParseTreeWalker walker = new ParseTreeWalker();
MyXMLListener listener = new MyXMLListener();
walker.walk(listener, ctx);
return offsets;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
public static void main(String[] arg) {
System.out.println("Search for offsets.");
List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
"page");
System.out.println("Offsets: " + offsets);
}
}
5. Result
Prints:
Offsets: [2441, 10854, 30257, 51419 ....
6. Read from Offset Position
To test the code I've written class that reads in each wikipedia page to a java object
#JacksonXmlRootElement
class Page {
public Page(){};
public String title;
}
using basically this code
private Page readPage(Integer offset, String filename) {
try (Reader in = new FileReader(filename)) {
in.skip(offset);
ObjectMapper mapper = new XmlMapper();
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
Page object = mapper.readValue(in, Page.class);
return object;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
Find complete example on github.
I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.
The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.
The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.
I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.
I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.
Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.
Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.
How do you merge two .odt files? Doing that by hand, opening each file and copying the content would work, but is unfeasable.
I have tried odttoolkit Simple API (simple-odf-0.8.1-incubating) to achieve that task, creating an empty TextDocument and merging everything into it:
private File masterFile = new File(...);
...
TextDocument t = TextDocument.newTextDocument();
t.save(masterFile);
...
for(File f : filesToMerge){
joinOdt(f);
}
...
void joinOdt(File joinee){
TextDocument master = (TextDocument) TextDocument.loadDocument(masterFile);
TextDocument slave = (TextDocument) TextDocument.loadDocument(joinee);
master.insertContentFromDocumentAfter(slave, master.getParagraphByReverseIndex(0, false), true);
master.save(masterFile);
}
And that works reasonably well, however it looses information about fonts - original files are a combination of Arial Narrow and Windings (for check boxes), output masterFile is all in TimesNewRoman. At first I suspected last parameter of insertContentFromDocumentAfter, but changing it to false breaks (almost) all formatting. Am I doing something wrong? Is there any other way?
I think this is "works as designed".
I tried this once with a global document, which imports documents and display them as is... as long as paragraph styles have different names !
Using same named templates are overwritten with the values the "master" document have.
So I ended up cloning standard styles with unique (per document) names.
HTH
Ma case was a rather simple one, files I wanted to merge were generated the same way and used the same basic formatting. Therefore, starting off of one of my files, instead of an empty document fixed my problem.
However this question will remain open until someone comes up with a more general solution to formatting retention (possibly based on ngulams answer and comments?).
I am new to Documentum DFC and I write a code using DFC API to check out a document and it worked properly. But now I want to check in the same file with a new file that is present in my local pc's drive with. I have tried to search it but didn't find any good as well as easy answers.
I will be grateful to you, if some one provides me guidance here.
New version (requires VERSION privileges):
boolean keepLock = false;
String versionLabels = "";
IDfSysObject doc = (IDfSysObject) session.getObject(new DfId("0900000000000000"));
doc.checkout();
doc.setFile("C:\\temp\\temp.jpg"); // assuming you're using windows
doc.checkin(keepLock, versionLabels);
keepLock - whether to keep the document checked out after checkin operation
versionLabels - label(s) (in addition to the built-in ones which are configured elsewhere)
Same version (requires WRITE privileges):
IDfSysObject doc = (IDfSysObject) session.getObject(new DfId("0900000000000000"));
doc.fetch(null);
doc.setFile("C:\\temp\\temp.jpg"); // again, assuming the worst ;)
doc.save();
Note that fetch(null) is needed to make sure you have the most current version of the document at hand.
For both examples above the content file is replaced without any further magic. Be sure to rename the document as desired, and set the correct format if necessary, e.g.:
doc.setObjectName("new_name");
doc.setContentType("new_format");
public void checkinDoc(String objectId) throws Exception
{
sysObject = (IDfSysObject) idfSession.getObjectByID(objectId);
//sysObject = (IDfSysObject) idfSession.getObjectByPath("/Cabinet/Folder/Document");
if (sysObject.isCheckedOut() ) { // if it is checked out
sysObject.checkin(false,”CURRENT”);
}
}
Use setFile on the checked out document, then checkin.