I am using a Word template with Excel graphs which I want to programmatically manipulate with the Java Apache POI library. For this I also need to be able to conditionally delete a Chart which is stored in this template.
Based on Axel Richters post (Removing chart from PowerPoint slide with Apache POI) I think I am almost there, but when I want to open the updated Word file it gives an error about unreadable content. This is what I have thus far:
PackagePart packagePartChart = xWPFChart.getPackagePart();
PackagePart packagePartWordDoc = xWPFDocument.getPackagePart();
OPCPackage packageWordDoc = packagePartWordDoc.getPackage();
// iterate over all relations the chart has and remove them
for (PackageRelationship chartrelship : packagePartChart.getRelationships()) {
String partname = chartrelship.getTargetURI().toString();
PackagePart part = packageWordDoc.getPartsByName(Pattern.compile(partname)).get(0);
packageWordDoc.removePart(part);
packagePartChart.removeRelationship(chartrelship.getId());
}
// now remove the chart itself from the word doc
Method removeRelation = POIXMLDocumentPart.class.getDeclaredMethod("removeRelation", POIXMLDocumentPart.class);
removeRelation.setAccessible(true);
removeRelation.invoke(xWPFDocument, xWPFChart);
If I unzip the Word file I correctly see that:
the relation between the WordDoc and the Chart are deleted in '\word\ _rels\document.xml.rels'
the chart itself is deleted in folder '\word\charts'
the relations between the documents supporting the Chart itself are deleted in folder '\word\charts\' _rels
the related chart items themselves are deleted:
StyleN / ColorsN in folder '\word\charts' and
Microsoft_Excel_WorksheetN in folder '\word\embeddings'
Anybody any idea on what could be going wrong here?
Indeed the finding of the right paragraph holding the chart was a challenge. In the end, for simplicity sake, I added a one row/one column placeholder table with one cell with a text, let's say 'targetWordString' in it, DIRECTLY before the chart. With the below function I am location the BodyElementID of this table:
private Integer iBodyElementIterator (XWPFDocument wordDoc,String targetWordString) {
Iterator<IBodyElement> iter = wordDoc.getBodyElementsIterator();
Integer bodyElementID = null;
while (iter.hasNext()) {
IBodyElement elem = iter.next();
bodyElementID = wordDoc.getBodyElements().indexOf(elem);
if (elem instanceof XWPFParagraph) {
XWPFParagraph paragraph = (XWPFParagraph) elem;
for (XWPFRun runText : paragraph.getRuns()) {
String text = runText.getText(0);
Core.getLogger("WordExporter").trace("Body Element ID: " + bodyElementID + " Text: " + text);
if (text != null && text.equals(targetWordString)) {
break;
}
}
} else if (elem instanceof XWPFTable) {
if (((XWPFTable) elem).getRow(0) != null && ((XWPFTable) elem).getRow(0).getCell(0) != null) {
// the first cell holds the name via the template
String tableTitle = ((XWPFTable) elem).getRow(0).getCell(0).getText();
if (tableTitle.equals(targetWordString)) {
break;
}
Core.getLogger("WordExporter").trace("Body Element ID: " + bodyElementID + " Text: " + tableTitle);
} else {
Core.getLogger("WordExporter").trace("Body Element ID: " + bodyElementID + " Table removed!");
}
}
else {
Core.getLogger("WordExporter").trace("Body Element ID: " + bodyElementID + " Text: ?");
}
}
return bodyElementID;
}
In the main part of the code I am calling this function to locate the table, then first delete the chart (ID +1) and then the table (ID)
int elementIDToBeRemoved = iBodyElementIterator(xWPFWordDoc,targetWordString);
xWPFWordDoc.removeBodyElement(elementIDToBeRemoved + 1);
xWPFWordDoc.removeBodyElement(elementIDToBeRemoved);
It needs to be in this order, because the ID's are reordered once you delete a number in between, hence deleting the table first, would mean the chart would get that ID in principle.
I did find a more structural solution for it without the use of dummy tables for the positioning. The solution is build out into 2 parts:
Determine relationship id of chart in word document
Remove chart based on relationship id and remove paragraph + runs
Determine relationship id of chart in word document
First I have determined with a boolean variable 'chartUsed' whether the chart needs to be deleted.
if (!chartUsed) {
String chartPartName = xWPFChart.getPackagePart().getPartName().getName();
String ridChartToBeRemoved = null;
// iterate over all relations the chart has and remove them
for (RelationPart relation : wordDoc.getRelationParts()) {
PackageRelationship relShip = relation.getRelationship();
if (relShip.getTargetURI().toString().equals(chartPartName)) {
ridChartToBeRemoved = relShip.getId();
Core.getLogger(logNode).info("Chart with title " + chartTitle + " has relationship id " + relShip.getId());
break;
}
}
removeChartByRelId(wordDoc, ridChartToBeRemoved);
}
(Core.getLogger is an internal API from the Mendix low code platform I use).
At the end I am calling the function removeChartByRelId which will remove the chart if the relation is found within a run of a paragraph:
Remove chart based on relationship id and remove paragraph + runs
private void removeChartByRelId (XWPFDocument wordDoc, String chartRelId) {
Iterator<IBodyElement> iter = wordDoc.getBodyElementsIterator();
Integer bodyElementID = null;
while (iter.hasNext()) {
IBodyElement elem = iter.next();
bodyElementID = wordDoc.getBodyElements().indexOf(elem);
if (elem instanceof XWPFParagraph) {
Core.getLogger(logNode).trace("XWPFParagraph Body Element ID: " + bodyElementID);
XWPFParagraph paragraph = (XWPFParagraph) elem;
boolean removeParagraph = false;
for (XWPFRun run : paragraph.getRuns()) {
if (run.getCTR().getDrawingList().size()>0) {
String drawingXMLText = run.getCTR().getDrawingList().get(0).xmlText();
boolean isChart = drawingXMLText.contains("http://schemas.openxmlformats.org/drawingml/2006/chart");
String graphicDataXMLText = run.getCTR().getDrawingList().get(0).getInlineArray(0).getGraphic().getGraphicData().xmlText();
boolean contains = graphicDataXMLText.contains(chartRelId);
if (isChart && contains) {
Core.getLogger(logNode).info("Removing chart with relId " + chartRelId + " as it isnt used in data");
run.getCTR().getDrawingList().clear();
removeParagraph = true;
}
}
}
if (removeParagraph) {
for (int i = 0; i < paragraph.getRuns().size(); i++){
paragraph.removeRun(i);
}
}
}
}
}
Related
I'm trying to parse docx file that contains content control fields (that are added using window like this, reference image, mine is on another language)
I'm using library APACHE POI. I found this question on how to do it. I used the same code:
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import java.util.List;
import java.util.ArrayList;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import org.apache.xmlbeans.XmlCursor;
import javax.xml.namespace.QName;
public class ReadWordForm {
private static List<XWPFSDT> extractSDTsFromBody(XWPFDocument document) {
XWPFSDT sdt;
XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
List<XWPFSDT> allsdts = new ArrayList<XWPFSDT>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
if (xmlcursor.getObject() instanceof CTSdtRun) {
sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document);
//System.out.println("block: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtBlock) {
sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document);
//System.out.println("inline: " + sdt);
allsdts.add(sdt);
}
}
}
}
return allsdts;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDataCollectingForm.docx"));
List<XWPFSDT> allsdts = extractSDTsFromBody(document);
for (XWPFSDT sdt : allsdts) {
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
}
}
document.close();
}
}
The problem is that this code doesn't see these fields in the my docx file until I open it in LibreOffice and re-save it. So if the file is from Windows being put into this code it doesn't see these content control fields. But if I re-save the file in the LibreOffice (using the same format) it starts to see these fields, even tho it loses some of the data (titles and tags of some fields). Can someone tell me what might be the reason of it, how do I fix that so it will see these fields? Or there's an easier way using docx4j maybe? Unfortunately there's not much info about how to do it using these 2 libs in the internet, at least I didn't find it.
Examle files are located on google disk. The first one doesn't work, the second one works (after it was opened in Libre and field was changed to one of the options).
According to your uploaded sample files your content controls are in a table. The code you had found only gets content controls from document body directly.
Tables are beastly things in Word as table cells may contain whole document bodies each. That's why content controls in table cells are strictly separated from content controls in main document body. Their ooxml class is CTSdtCell instead of CTSdtRun or CTSdtBlock and in apache poi their class is XWPFSDTCell instead of XWPFSDT.
If it is only about reading the content, then one could fall back to XWPFAbstractSDT which is the abstract parent class of XWPFSDTCell as well as of XWPFSDT. So following code should work:
private static List<XWPFAbstractSDT> extractSDTsFromBody(XWPFDocument document) {
XWPFAbstractSDT sdt;
XmlCursor xmlcursor = document.getDocument().getBody().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
List<XWPFAbstractSDT> allsdts = new ArrayList<XWPFAbstractSDT>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
if (xmlcursor.getObject() instanceof CTSdtRun) {
sdt = new XWPFSDT((CTSdtRun)xmlcursor.getObject(), document);
//System.out.println("block: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtBlock) {
sdt = new XWPFSDT((CTSdtBlock)xmlcursor.getObject(), document);
//System.out.println("inline: " + sdt);
allsdts.add(sdt);
} else if (xmlcursor.getObject() instanceof CTSdtCell) {
sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null);
//System.out.println("cell: " + sdt);
allsdts.add(sdt);
}
}
}
}
return allsdts;
}
But as you see in code line sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), null, null), the XWPFSDTCell totaly lost its connection to table and tablerow.
There is not a proper method to get the XWPFSDTCell directly from a XWPFTable. So If one would need to get XWPFSDTCell connected to its table, then also parsing the XML is needed. This could look like so:
private static List<XWPFSDTCell> extractSDTsFromTableRow(XWPFTableRow row) {
XWPFSDTCell sdt;
XmlCursor xmlcursor = row.getCtRow().newCursor();
QName qnameSdt = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "sdt", "w");
QName qnameTr = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "tr", "w");
List<XWPFSDTCell> allsdts = new ArrayList<XWPFSDTCell>();
while (xmlcursor.hasNextToken()) {
XmlCursor.TokenType tokentype = xmlcursor.toNextToken();
if (tokentype.isStart()) {
if (qnameSdt.equals(xmlcursor.getName())) {
//System.out.println(xmlcursor.getObject().getClass().getName());
if (xmlcursor.getObject() instanceof CTSdtCell) {
sdt = new XWPFSDTCell((CTSdtCell)xmlcursor.getObject(), row, row.getTable().getBody());
//System.out.println("cell: " + sdt);
allsdts.add(sdt);
}
}
} else if (tokentype.isEnd()) {
//we have to check whether we are at the end of the table row
xmlcursor.push();
xmlcursor.toParent();
if (qnameTr.equals(xmlcursor.getName())) {
break;
}
xmlcursor.pop();
}
}
return allsdts;
}
And called from document like so:
...
for (XWPFTable table : document.getTables()) {
for (XWPFTableRow row : table.getRows()) {
List<XWPFSDTCell> allTrsdts = extractSDTsFromTableRow(row);
for (XWPFSDTCell sdt : allTrsdts) {
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
System.out.println(content);
}
}
}
}
...
Using curren apache poi 5.2.0 it is possible getting the XWPFSDTCell from XWPFTableRow via XWPFTableRow.getTableICells. This gets al List of ICells which is an interface which XWPFSDTCell also implemets.
So following code will get all XWPFSDTCell from tables without the need of low level XML parsing:
...
for (XWPFTable table : document.getTables()) {
for (XWPFTableRow row : table.getRows()) {
for (ICell iCell : row.getTableICells()) {
if (iCell instanceof XWPFSDTCell) {
XWPFSDTCell sdt = (XWPFSDTCell)iCell;
//System.out.println(sdt);
String title = sdt.getTitle();
String content = sdt.getContent().getText();
if (!(title == null) && !(title.isEmpty())) {
System.out.println(title + ": " + content);
} else {
System.out.println("====sdt without title====");
System.out.println(content);
}
}
}
}
}
...
_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.
As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1
This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}
This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}
Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}
I am using docx4j to deal with word document formatting. I have one word document which is divided in number of tables. I want to read all the tables and if I find some keywords then I want to take those contents to another word document with all the formatting. My word document is as follow.
Like from above I want to take content which is below Some Title. Here my keyword is Sample Text. So whenever Sample Text gets repeated, content needs to be fetched to new word document.
I am using following code.
MainDocumentPart mainDocumentPart = null;
WordprocessingMLPackage docxFile = WordprocessingMLPackage.load(new File(fileName));
mainDocumentPart = docxFile.getMainDocumentPart();
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
ClassFinder finder = new ClassFinder(Tbl.class);
new TraversalUtil(mainDocumentPart.getContent(), finder);
Tbl tbl = null;
int noTbls = 0;
int noRows = 0;
int noCells = 0;
int noParas = 0;
int noTexts = 0;
for (Object table : finder.results) {
noTbls++;
tbl = (Tbl) table;
// Get all the Rows in the table
List<Object> allRows = DocxUtility.getDocxUtility()
.getAllElementFromObject(tbl, Tr.class);
for (Object row : allRows) {
Tr tr = (Tr) row;
noRows++;
// Get all the Cells in the Row
List<Object> allCells = DocxUtility.getDocxUtility()
.getAllElementFromObject(tr, Tc.class);
toCell:
for (Object cell : allCells) {
Tc tc = (Tc) cell;
noCells++;
// Get all the Paragraph's in the Cell
List<Object> allParas = DocxUtility.getDocxUtility()
.getAllElementFromObject(tc, P.class);
for (Object para : allParas) {
P p = (P) para;
noParas++;
// Get all the Run's in the Paragraph
List<Object> allRuns = DocxUtility.getDocxUtility()
.getAllElementFromObject(p, R.class);
for (Object run : allRuns) {
R r = (R) run;
// Get the Text in the Run
List<Object> allText = DocxUtility.getDocxUtility()
.getAllElementFromObject(r, Text.class);
for (Object text : allText) {
noTexts++;
Text txt = (Text) text;
}
System.out.println("No of Text in Para No: " + noParas + "are: " + noTexts);
}
}
System.out.println("No of Paras in Cell No: " + noCells + "are: " + noParas);
}
System.out.println("No of Cells in Row No: " + noRows + "are: " + noCells);
}
System.out.println("No of Rows in Table No: " + noTbls + "are: " + noRows);
}
System.out.println("Total no of Tables: " + noTbls );
Assuming your text is in a single run (ie not split across runs), then you can search for it via XPath. Or you can manually traverse using TraversalUtil. See docx4j's Getting Started for more info.
So finding your stuff is pretty easy. Copying the formatting it uses, and any rels in it, is in the general case, complicated. See my post http://www.docx4java.org/blog/2010/11/merging-word-documents/ for more on the issues involved.
I'm using the SWT OLE api to edit a Word document in an Eclipse RCP. I read articles about how to read properties from the active document but now I'm facing a problem with collections like sections.
I would like to retrieve only the body section of my document but I don't know what to do with my sections object which is an IDispatch object. I read that the item method should be used but I don't understand how.
I found the solution so I'll share it with you :)
Here is a sample code to list all paragraphs of the active document of the word editor :
OleAutomation active = activeDocument.getAutomation();
if(active!=null){
int[] paragraphsId = getId(active, "Paragraphs");
if(paragraphsId.length > 0) {
Variant vParagraphs = active.getProperty(paragraphsId[0]);
if(vParagraphs != null){
OleAutomation paragraphs = vParagraphs.getAutomation();
if(paragraphs!=null){
int[] countId = getId(paragraphs, "Count");
if(countId.length > 0) {
Variant count = paragraphs.getProperty(countId[0]);
if(count!=null){
int numberOfParagraphs = count.getInt();
for(int i = 1 ; i <= numberOfParagraphs ; i++) {
Variant paragraph = paragraphs.invoke(0, new Variant[]{new Variant(i)});
if(paragraph!=null){
System.out.println("paragraph " + i + " added to list!");
listOfParagraphs.add(paragraph);
}
}
return listOfParagraphs;
}
}
}
}
}
I use OpenCmis in-memory for testing. But when I create a document I am not allowed to set the versioningState to something else then versioningState.NONE.
The doc created is not versionable some way... I used the code from http://chemistry.apache.org/java/examples/example-create-update.html
The test method:
public void test() {
String filename = "test123";
Folder folder = this.session.getRootFolder();
// Create a doc
Map<String, Object> properties = new HashMap<String, Object>();
properties.put(PropertyIds.OBJECT_TYPE_ID, "cmis:document");
properties.put(PropertyIds.NAME, filename);
String docText = "This is a sample document";
byte[] content = docText.getBytes();
InputStream stream = new ByteArrayInputStream(content);
ContentStream contentStream = this.session.getObjectFactory().createContentStream(filename, Long.valueOf(content.length), "text/plain", stream);
Document doc = folder.createDocument(
properties,
contentStream,
VersioningState.MAJOR);
}
The exception I get:
org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The versioning state flag is imcompatible to the type definition.
What am I missing?
I found the reason...
By executing the following code I discovered that the OBJECT_TYPE_ID 'cmis:document' don't allow versioning.
Code to view all available OBJECT_TYPE_ID's (source):
boolean includePropertyDefintions = true;
for (t in session.getTypeDescendants(
null, // start at the top of the tree
-1, // infinite depth recursion
includePropertyDefintions // include prop defs
)) {
printTypes(t, "");
}
static void printTypes(Tree tree, String tab) {
ObjectType objType = tree.getItem();
println(tab + "TYPE:" + objType.getDisplayName() +
" (" + objType.getDescription() + ")");
// Print some of the common attributes for this type
print(tab + " Id:" + objType.getId());
print(" Fileable:" + objType.isFileable());
print(" Queryable:" + objType.isQueryable());
if (objType instanceof DocumentType) {
print(" [DOC Attrs->] Versionable:" +
((DocumentType)objType).isVersionable());
print(" Content:" +
((DocumentType)objType).getContentStreamAllowed());
}
println(""); // end the line
for (t in tree.getChildren()) {
// there are more - call self for next level
printTypes(t, tab + " ");
}
}
This resulted in a list like this:
TYPE:CMIS Folder (Description of CMIS Folder Type) Id:cmis:folder
Fileable:true Queryable:true
TYPE:CMIS Document (Description of CMIS Document Type)
Id:cmis:document Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:My Type 1 Level 1 (Description of My Type 1 Level 1 Type)
Id:MyDocType1 Fileable:true Queryable:true [DOC Attrs->]
Versionable:false Content:ALLOWED
TYPE:VersionedType (Description of VersionedType Type)
Id:VersionableType Fileable:true Queryable:true [DOC Attrs->]
Versionable:true Content:ALLOWED
As you can see the last OBJECT_TYPE_ID has versionable: true... and when I use that it does work.