How to add an altChunk element to a XWPFDocument using Apache POI - java

I would like to add HTML as an altChunk to a DOCX file using Apache POI. I know that doc4jx can do this with a simpler API but for technical reasons I need to use Apache POI.
Using the CT classes to do low level stuff with the xml is a little tricky. I can create an altChunk with following code:
import javax.xml.namespace.QName;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.xmlbeans.impl.values.XmlComplexContentImpl;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTBodyImpl;
public class AltChunkTest {
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument();
doc.createParagraph().createRun().setText("AltChunk below:");
QName ALTCHUNK = new QName ( "" , "altChunk" ) ;
CTDocument1 ctDoc = doc.getDocument() ;
CTBodyImpl ctBody = (CTBodyImpl) ctDoc. getBody();
XmlComplexContentImpl xcci = ( XmlComplexContentImpl ) ctBody.get_store().add_element_user(ALTCHUNK);
// what's need to now add "<b>Hello World!</b>"
FileOutputStream out = new FileOutputStream(new File("test.docx"));
But how do I add the html content to 'xcci' it now?

In Office Open XML for Word (*.docx) the altChunk provides a method for using pure HTML to describe document parts.
Two important notes about altChunk:
First: It is used only for importing content. If you open the document using Word and save it, the newly saved document will not contain the alternative format content part, nor the altChunk markup that references it. Word saves all imported content as default Office Open XML elements.
Second: Most applications except Word which are able reading *.docx too will not reading the altChunk content at all. For example Libreoffice or OpenOffice Writer will not reading the altChunk content as well as apache poi will not reading the altChunk content when opening a *.docx file.
How is altChunk implemented in the *.docx ZIP file structure?
There are /word/*.html files in the *.docx ZIP file. Those are referenced by Id in /word/document.xml as <w:altChunk r:id="htmlDoc1"/> for example. The relation between the Ids and the /word/*.html files are given in /word/_rels/document.xml.rels as <Relationship Id="htmlDoc1" Target="htmlDoc1.html" Type=""/> for example.
So we need at first POIXMLDocumentParts for the /word/*.html files and POIXMLRelations for the relation between the Ids and the /word/*.html files. Following code provides that by having a wrapper class which extends POIXMLDocumentPart for the /word/htmlDoc#.html files in the *.docx ZIP archive. This also provides methods for manipulating the HTML. Also it provides a method for creating the /word/htmlDoc#.html files in the *.docx ZIP archive and creating relations to it.
import org.apache.poi.*;
import org.apache.poi.ooxml.*;
import org.apache.poi.openxml4j.opc.*;
import org.apache.poi.xwpf.usermodel.*;
public class CreateWordWithHTMLaltChunk {
//a method for creating the htmlDoc /word/htmlDoc#.html in the *.docx ZIP archive
//String id will be htmlDoc#.
private static MyXWPFHtmlDocument createHtmlDoc(XWPFDocument document, String id) throws Exception {
OPCPackage oPCPackage = document.getPackage();
PackagePartName partName = PackagingURIHelper.createPartName("/word/" + id + ".html");
PackagePart part = oPCPackage.createPart(partName, "text/html");
MyXWPFHtmlDocument myXWPFHtmlDocument = new MyXWPFHtmlDocument(part, id);
document.addRelation(myXWPFHtmlDocument.getId(), new XWPFHtmlRelation(), myXWPFHtmlDocument);
return myXWPFHtmlDocument;
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph;
XWPFRun run;
MyXWPFHtmlDocument myXWPFHtmlDocument;
paragraph = document.createParagraph();
run = paragraph.createRun();
run.setText("Default paragraph followed by first HTML chunk.");
myXWPFHtmlDocument = createHtmlDoc(document, "htmlDoc1");
"<body><p>Simple <b>HTML</b> <i>formatted</i> <u>text</u></p></body>"));
paragraph = document.createParagraph();
run = paragraph.createRun();
run.setText("Default paragraph followed by second HTML chunk.");
myXWPFHtmlDocument = createHtmlDoc(document, "htmlDoc2");
"<body>" +
"<caption>A table></caption>" +
"<tr><th>Name</th><th>Date</th><th>Amount</th></tr>" +
"<tr><td>John Doe</td><td>2018-12-01</td><td>1,234.56</td></tr>" +
"</table>" +
FileOutputStream out = new FileOutputStream("CreateWordWithHTMLaltChunk.docx");
//a wrapper class for the htmlDoc /word/htmlDoc#.html in the *.docx ZIP archive
//provides methods for manipulating the HTML
//TODO: We should *not* using String methods for manipulating HTML!
private static class MyXWPFHtmlDocument extends POIXMLDocumentPart {
private String html;
private String id;
private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>"; = id;
private String getId() {
return id;
private String getHtml() {
return html;
private void setHtml(String html) {
this.html = html;
protected void commit() throws IOException {
PackagePart part = getPackagePart();
OutputStream out = part.getOutputStream();
Writer writer = new OutputStreamWriter(out, "UTF-8");
//the XWPFRelation for /word/htmlDoc#.html
private final static class XWPFHtmlRelation extends POIXMLRelation {
private XWPFHtmlRelation() {
Note: Because of using altChunk this code needs the full jar of all of the schemas ooxml-schemas-*.jar as mentioned in apache poi faq-N10025.

Based on Axel Richter's answer, I replaced the call to CTBody.addNewAltChunk() with CTBodyImpl.get_store().add_element_user(QName) which eliminates the added 15MB dependency on ooxml-schemas. Since this is being used in a desktop app, we are trying to keep the app size as small as possible. In case it may be of help to anyone else:
import javax.xml.namespace.QName;
import org.apache.poi.ooxml.POIXMLDocumentPart;
import org.apache.poi.ooxml.POIXMLRelation;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import org.apache.poi.openxml4j.opc.PackagePartName;
import org.apache.poi.openxml4j.opc.PackagingURIHelper;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.xmlbeans.SimpleValue;
import org.apache.xmlbeans.impl.values.XmlComplexContentImpl;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTBodyImpl;
public class AltChunkTest {
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument();
doc.createParagraph().createRun().setText("AltChunk below:");
addHtml(doc,"chunk1","<!DOCTYPE html><html><head><style></style><title></title></head><body><b>Hello World!</b></body></html>");
FileOutputStream out = new FileOutputStream(new File("test.docx"));
static void addHtml(XWPFDocument doc, String id,String html) throws Exception {
OPCPackage oPCPackage = doc.getPackage();
PackagePartName partName = PackagingURIHelper.createPartName("/word/" + id + ".html");
PackagePart part = oPCPackage.createPart(partName, "text/html");
class HtmlRelation extends POIXMLRelation {
private HtmlRelation() {
super( "text/html",
class HtmlDocumentPart extends POIXMLDocumentPart {
private HtmlDocumentPart(PackagePart part) throws Exception {
protected void commit() throws IOException {
try (OutputStream out = part.getOutputStream()) {
try (Writer writer = new OutputStreamWriter(out, "UTF-8")) {
HtmlDocumentPart documentPart = new HtmlDocumentPart(part);
doc.addRelation(id, new HtmlRelation(), documentPart);
CTBodyImpl b = (CTBodyImpl) doc.getDocument().getBody();
QName ALTCHUNK = new QName("", "altChunk");
XmlComplexContentImpl altchunk = (XmlComplexContentImpl) b.get_store().add_element_user(ALTCHUNK);
QName ID = new QName("", "id");
SimpleValue target = (SimpleValue)altchunk.get_store().add_attribute_user(ID);

This feature is ok in poi-ooxml 4.0.0, where the class POIXMLDocumentPart and POIXMLRelation are in the package org.apache.poi.ooxml.*
import org.apache.poi.ooxml.POIXMLDocumentPart;
import org.apache.poi.ooxml.POIXMLRelation;
But how we can use in poi-ooxml 3.9, where the class are little different and in the org.apache.poi.*
import org.apache.poi.POIXMLDocumentPart;
import org.apache.poi.POIXMLRelation;


Apache POI, changing File Mime Type. It's possible to Fix it?

I'm having problems with Apache POI and File Mime Type.
I Use a file template (Microsoft Word DOCX) to modify some values with Apache Poi.
The original file has the mime type "application/vnd.openxmlformats-officedocument.wordprocessingml.document" (in linux: file -i {filename}), but and I process the file with POI and save then again I got "application/octet-stream" and I wish to Keep the File with the original mime type.
I open the file with HEX editor, both files original and modified and both has the same "magical numbers" (50 4B 03 04), but the file size is different, even when the texts are the same.
So It's possible to fix it? Anyone have the same problem? I check it in LibreOffice and appears to have same behavior of Apache POI.
Any help, any information will help.
As you already stated in a comment, the kind how Apache POI rearranges the Office Open XML ZIP package leads to misinterpreting the content type by some tools. An Office Open XML file (*.docx, *.xlsx, *.pptx) is a ZIP archive but somewhat how Microsoft Office is packing that archive must be special. I have not found what exactly it is though.
Start having a Document.docx having some simple content, which was saved by Microsoft Word.
For this, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i Document.docx
Document.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
Now run that code:
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordReadAndReWrite {
public static void main(String[] args) throws Exception {
String inFilePath = "Document.docx";
String outFilePath = "NewDocument.docx";
XWPFDocument doc = new XWPFDocument(new FileInputStream(inFilePath));
doc.createParagraph().createRun().setText("new text inserted");
FileOutputStream out = new FileOutputStream(outFilePath);
For the resulting NewDocument.docx, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx
NewDocument.docx: application/octet-stream; charset=binary
But if we are doing the same without using Apache POI's ZipPackage but instead using FileSystem for getting the XML out of the Office Open XML ZIP package using following code:
import java.nio.file.Files;
import java.nio.file.FileSystems;
import java.nio.file.FileSystem;
import java.nio.file.Paths;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
import java.nio.file.StandardOpenOption;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.dom.DOMSource;
public class WordReadAndReWriteFileSystem {
public static void main(String[] args) throws Exception {
String inFilePath = "Document.docx";
String outFilePath = "NewDocument.docx";
FileSystem fileSystem = FileSystems.newFileSystem(Paths.get(inFilePath), null);
Path wordDocumentXml = fileSystem.getPath("/word/document.xml");
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document xmlDocument = documentBuilder.parse(Files.newInputStream(wordDocumentXml, StandardOpenOption.READ));
Node p = xmlDocument.createElement("w:p");
Node r = xmlDocument.createElement("w:r");
Node t = xmlDocument.createElement("w:t");
Node text = xmlDocument.createTextNode("new text inserted");
Node body = xmlDocument.getElementsByTagName("w:body").item(0);
Node sectPr = xmlDocument.getElementsByTagName("w:sectPr").item(0);
body.insertBefore(p, sectPr);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource domSource = new DOMSource(xmlDocument);
Path tmpDoc = Files.createTempFile("wordDocument", "tmp");
StreamResult streamResult = new StreamResult(Files.newOutputStream(tmpDoc, StandardOpenOption.WRITE));
transformer.transform(domSource, streamResult);
Path tmpZip = Files.createTempFile("zipDocument", "tmp");
Path path = Files.copy(Paths.get(inFilePath), tmpZip, StandardCopyOption.REPLACE_EXISTING);
fileSystem = FileSystems.newFileSystem(path, null);
wordDocumentXml = fileSystem.getPath("/word/document.xml");
Files.copy(tmpDoc, wordDocumentXml, StandardCopyOption.REPLACE_EXISTING);
Files.copy(tmpZip, Paths.get(outFilePath), StandardCopyOption.REPLACE_EXISTING);
Then for the resulting NewDocument.docx, file -i produces:
axel#arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx
NewDocument.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary
This Code shows the correct mime type of file for all files that I test:
public static void main(String[] args) {
String fileName = "model_libreoffice.docx";
// String fileName = "model_poi.docx";
// String fileName = "model_msoffice.docx";
// String fileName = "model_repacked_bz2.docx";
try {
InputStream is = Main.class.getResourceAsStream("/" + fileName);
Tika t = new Tika();
String mime = t.detect(is, fileName);
System.out.println("----> " + mime);
} catch (IOException e) {
After long debugging and testing I think that's problem with Third party validation of files.
This simple code shows me the correct mime type for all files that I try, modified by MicrosoftOffice, LibreOffice, Apache Poi, Unzip and Zipping again (renamed to DOCX) the content files of DOCX...
So I think that's problem can be mark as "solved" at all.

How to manipulate content of a comment with Apache POI

I would like to find a comment in Docx document (somehow, by author or ID...), then create new content. I was able to create a comment, with the help of this answer, but had no luck with manipulation.
As said in my answer linked in your question, until now the XWPFdocument will only read that package part while creating. There is neither write access nor a possibility to create that package part. This is mentioned in - protected void onDocumentRead(): code line 210: "// TODO Create according XWPFComment class, extending POIXMLDocumentPart".
So we need doing this ourself until now. We need providing class extending POIXMLDocumentPart for comments and registering this relation instead of only relation to the simple POIXMLDocumentPart. So that and changings can be made which were committed while writing the XWPFDocument.
import org.apache.poi.*;
import org.apache.poi.openxml4j.opc.*;
import org.apache.xmlbeans.*;
import org.apache.poi.xwpf.usermodel.*;
import static org.apache.poi.POIXMLTypeLoader.DEFAULT_XML_OPTIONS;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import javax.xml.namespace.QName;
import java.math.BigInteger;
import java.util.GregorianCalendar;
import java.util.Locale;
public class WordChangeComments {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocumentHavingComments.docx"));
for (POIXMLDocumentPart.RelationPart rpart : document.getRelationParts()) {
String relation = rpart.getRelationship().getRelationshipType();
if (relation.equals(XWPFRelation.COMMENT.getRelation())) {
POIXMLDocumentPart part = rpart.getDocumentPart(); //this is only POIXMLDocumentPart, not a high level class extending POIXMLDocumentPart
//provide class extending POIXMLDocumentPart for comments
MyXWPFCommentsDocument myXWPFCommentsDocument = new MyXWPFCommentsDocument(part.getPackagePart());
//and registering this relation instead of only relation to POIXMLDocumentPart
String rId = document.getRelationId(part);
document.addRelation(rId, XWPFRelation.COMMENT, myXWPFCommentsDocument);
//now the comments are available from the new MyXWPFCommentsDocument
for (CTComment ctComment : myXWPFCommentsDocument.getComments().getCommentArray()) {
System.out.print("Comment: Id: " + ctComment.getId());
System.out.print(", Author: " + ctComment.getAuthor());
System.out.print(", Date: " + ctComment.getDate());
System.out.print(", Text: ");
for (CTP ctp : ctComment.getPArray()) {
//and changings can be made which were committed while writing the XWPFDocument
if (BigInteger.ONE.equals(ctComment.getId())) { //the second comment (Id 0 = first)
ctComment.setAuthor("New Author");
ctComment.setDate(new GregorianCalendar(Locale.US));
CTP newCTP = CTP.Factory.newInstance();
newCTP.addNewR().addNewT().setStringValue("The new Text for Comment with Id 1.");
ctComment.setPArray(new CTP[]{newCTP });
document.write(new FileOutputStream("WordDocumentHavingComments.docx"));
//a wrapper class for the CommentsDocument /word/comments.xml in the *.docx ZIP archive
private static class MyXWPFCommentsDocument extends POIXMLDocumentPart {
private CTComments comments;
private MyXWPFCommentsDocument(PackagePart part) throws Exception {
comments = CommentsDocument.Factory.parse(part.getInputStream(), DEFAULT_XML_OPTIONS).getComments();
private CTComments getComments() {
return comments;
protected void commit() throws IOException {
System.out.println("============MyXWPFCommentsDocument is committed=================");
XmlOptions xmlOptions = new XmlOptions(DEFAULT_XML_OPTIONS);
xmlOptions.setSaveSyntheticDocumentElement(new QName(CTComments.type.getName().getNamespaceURI(), "comments"));
PackagePart part = getPackagePart();
OutputStream out = part.getOutputStream();, xmlOptions);
This works for apache poi 3.17. Since apache poi 4.0.0 the ooxml part is separated. So there must be:
import org.apache.poi.ooxml.*;
import static org.apache.poi.ooxml.POIXMLTypeLoader.DEFAULT_XML_OPTIONS;

Convert DOCX to HTML incliding IMAGES

I am using DOCX4J to convert the DOCX to HTML .I have successfully done the conversion and got the html format.I will be using the html format to embed it as EMAIL body to send an email.But I have some issues which are listed below....
Unable to display images in email body
Losing the spaces and bullets
Please find the code which I have written,
WordprocessingMLPackage wordMLPackage;
wordMLPackage = Docx4J.load(new;
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath(imageFolder + resourcePath2 + "_files");
htmlSettings.setImageTargetUri(imageFolder +resourcePath2.substring(resourcePath2.lastIndexOf("/")+1) + "_files");
OutputStream os;
os = new ByteArrayOutputStream();
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_SAVE_FLAT_XML);
DOCX = ((ByteArrayOutputStream)os).toString();
You may add like this in your code
package tcg.doc.web.managedBeans;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;
public class ConvertWord {
private static final String docName = "TestDocx.docx";
private static final String outputlFolderPath = "d:/";
String htmlNamePath = "docHtml.html";
String zipName="";
File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
// 1) Load DOCX into XWPFDocument
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
XWPFDocument document = new XWPFDocument(doc);
// 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
// Extract image
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
// URI resolver
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
System.out.println("OutputStream "+out.toString());
} catch (FileNotFoundException ex) {
} catch (IOException ex) {
public static void main(String[] args) {
ConvertWord cwoWord=new ConvertWord();
public String htmlPath(){
// d:/docHtml.html
return outputlFolderPath+htmlNamePath;
public String zipPath(){
// d:/
return outputlFolderPath+zipName;
For maven Dependency on pom.xml
or download it from Here
For images to work in an email body, I guess you need to use either a data URI or publish them to a web-reachable location.
In either case, you'll need to write an implementation of:
public interface ConversionImageHandler {
* #param picture
* #param relationship of the image
* #param part of the image, if it is an internal image, otherwise null
* #return uri for the image we've saved, or null
* #throws Docx4JException this exception will be logged, but not propagated
public String handleImage(AbstractWordXmlPicture picture, Relationship relationship, BinaryPart part) throws Docx4JException;
and configure docx4j to use it with htmlSettings.setImageHandler.
You can look at some of the existing implementations in the docx4j source code, and take advantage of the helper methods in AbstractConversionImageHandler (eg createEncodedImage if you want data URIs).

How to convert HTML To PDF and open pdf file, using java with YaHP Html to Pdf Convertor

Am using a YaHP-Converter to convert HTML File to Pdf. Here is the code example i have used for converting. The code works me fine. But i want open Pdf file after this conversion.
Any idea please.
CYaHPConverter converter = new CYaHPConverter();
FileOutputStream out = new FileOutputStream(pdfOut);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
Thanks in advance
I think this helps:
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import org.allcolor.yahp.converter.CYaHPConverter;
import org.allcolor.yahp.converter.IHtmlToPdfTransformer;
public class HtmlToPdf_yahp_2 {
public static void main(String ... args ) throws Exception {
String root = "c:/temp/html";
String input = "file_1659686.htm"; // need to be charset utf-8
htmlToPdfFile(new File(root, input),
new File(root, input + ".pdf"));
public static void htmlToPdfFile(File htmlIn, File pdfOut) throws Exception {
Scanner scanner =
new Scanner(htmlIn).useDelimiter("\\Z");
String htmlContents =;
CYaHPConverter converter = new CYaHPConverter();
FileOutputStream out = new FileOutputStream(pdfOut);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
//properties.put(IHtmlToPdfTransformer.FOP_TTF_FONT_PATH, fontPath);
See this for futher info:

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?
This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
* This class is used to read .doc and .docx files
* #author Developer
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
} else {
throw new Exception();
To compile:
javac -cp ".:tika-app-1.2.jar"
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();
Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
