Convert DOCX to HTML incliding IMAGES

Convert DOCX to HTML incliding IMAGES - java

I am using DOCX4J to convert the DOCX to HTML .I have successfully done the conversion and got the html format.I will be using the html format to embed it as EMAIL body to send an email.But I have some issues which are listed below....
Unable to display images in email body
Losing the spaces and bullets
Please find the code which I have written,
WordprocessingMLPackage wordMLPackage;
wordMLPackage = Docx4J.load(new java.io.File(resourcePath2));
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setImageDirPath(imageFolder + resourcePath2 + "_files");
htmlSettings.setImageTargetUri(imageFolder +resourcePath2.substring(resourcePath2.lastIndexOf("/")+1) + "_files");
htmlSettings.setWmlPackage(wordMLPackage);
OutputStream os;
os = new ByteArrayOutputStream();
Docx4jProperties.setProperty("docx4j.Convert.Out.HTML.OutputMethodXML", true);
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_SAVE_FLAT_XML);
DOCX = ((ByteArrayOutputStream)os).toString();

You may add like this in your code
package tcg.doc.web.managedBeans;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.context.annotation.Scope;
import org.springframework.stereotype.Component;
#Component
#Scope("session")
#Qualifier("ConvertWord")
public class ConvertWord {
private static final String docName = "TestDocx.docx";
private static final String outputlFolderPath = "d:/";
String htmlNamePath = "docHtml.html";
String zipName="_tmp.zip";
File docFile = new File(outputlFolderPath+docName);
File zipFile = new File(zipName);
public void ConvertWordToHtml() {
try {
// 1) Load DOCX into XWPFDocument
InputStream doc = new FileInputStream(new File(outputlFolderPath+docName));
System.out.println("InputStream"+doc);
XWPFDocument document = new XWPFDocument(doc);
// 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;
// Extract image
String root = "target";
File imageFolder = new File( root + "/images/" + doc );
options.setExtractor( new FileImageExtractor( imageFolder ) );
// URI resolver
options.URIResolver( new FileURIResolver( imageFolder ) );
OutputStream out = new FileOutputStream(new File(htmlPath()));
XHTMLConverter.getInstance().convert(document, out, options);
System.out.println("OutputStream "+out.toString());
} catch (FileNotFoundException ex) {
} catch (IOException ex) {
}
}
public static void main(String[] args) {
ConvertWord cwoWord=new ConvertWord();
cwoWord.ConvertWordToHtml();
System.out.println();
}
public String htmlPath(){
// d:/docHtml.html
return outputlFolderPath+htmlNamePath;
}
public String zipPath(){
// d:/_tmp.zip
return outputlFolderPath+zipName;
}
}
For maven Dependency on pom.xml
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
<version>1.0.4</version>
</dependency>
or download it from Here

For images to work in an email body, I guess you need to use either a data URI or publish them to a web-reachable location.
In either case, you'll need to write an implementation of:
public interface ConversionImageHandler {
/**
* #param picture
* #param relationship of the image
* #param part of the image, if it is an internal image, otherwise null
* #return uri for the image we've saved, or null
* #throws Docx4JException this exception will be logged, but not propagated
*/
public String handleImage(AbstractWordXmlPicture picture, Relationship relationship, BinaryPart part) throws Docx4JException;
}
and configure docx4j to use it with htmlSettings.setImageHandler.
You can look at some of the existing implementations in the docx4j source code, and take advantage of the helper methods in AbstractConversionImageHandler (eg createEncodedImage if you want data URIs).

Related

How to add an altChunk element to a XWPFDocument using Apache POI

I would like to add HTML as an altChunk to a DOCX file using Apache POI. I know that doc4jx can do this with a simpler API but for technical reasons I need to use Apache POI.
Using the CT classes to do low level stuff with the xml is a little tricky. I can create an altChunk with following code:
import java.io.File;
import java.io.FileOutputStream;
import javax.xml.namespace.QName;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.xmlbeans.impl.values.XmlComplexContentImpl;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTBodyImpl;
public class AltChunkTest {
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument();
doc.createParagraph().createRun().setText("AltChunk below:");
QName ALTCHUNK = new QName ( "http://schemas.openxmlformats.org/wordprocessingml/2006/main" , "altChunk" ) ;
CTDocument1 ctDoc = doc.getDocument() ;
CTBodyImpl ctBody = (CTBodyImpl) ctDoc. getBody();
XmlComplexContentImpl xcci = ( XmlComplexContentImpl ) ctBody.get_store().add_element_user(ALTCHUNK);
// what's need to now add "<b>Hello World!</b>"
FileOutputStream out = new FileOutputStream(new File("test.docx"));
doc.write(out);
}
}
But how do I add the html content to 'xcci' it now?

In Office Open XML for Word (*.docx) the altChunk provides a method for using pure HTML to describe document parts.
Two important notes about altChunk:
First: It is used only for importing content. If you open the document using Word and save it, the newly saved document will not contain the alternative format content part, nor the altChunk markup that references it. Word saves all imported content as default Office Open XML elements.
Second: Most applications except Word which are able reading *.docx too will not reading the altChunk content at all. For example Libreoffice or OpenOffice Writer will not reading the altChunk content as well as apache poi will not reading the altChunk content when opening a *.docx file.
How is altChunk implemented in the *.docx ZIP file structure?
There are /word/*.html files in the *.docx ZIP file. Those are referenced by Id in /word/document.xml as <w:altChunk r:id="htmlDoc1"/> for example. The relation between the Ids and the /word/*.html files are given in /word/_rels/document.xml.rels as <Relationship Id="htmlDoc1" Target="htmlDoc1.html" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk"/> for example.
So we need at first POIXMLDocumentParts for the /word/*.html files and POIXMLRelations for the relation between the Ids and the /word/*.html files. Following code provides that by having a wrapper class which extends POIXMLDocumentPart for the /word/htmlDoc#.html files in the *.docx ZIP archive. This also provides methods for manipulating the HTML. Also it provides a method for creating the /word/htmlDoc#.html files in the *.docx ZIP archive and creating relations to it.
Code:
import java.io.*;
import org.apache.poi.*;
import org.apache.poi.ooxml.*;
import org.apache.poi.openxml4j.opc.*;
import org.apache.poi.xwpf.usermodel.*;
public class CreateWordWithHTMLaltChunk {
//a method for creating the htmlDoc /word/htmlDoc#.html in the *.docx ZIP archive
//String id will be htmlDoc#.
private static MyXWPFHtmlDocument createHtmlDoc(XWPFDocument document, String id) throws Exception {
OPCPackage oPCPackage = document.getPackage();
PackagePartName partName = PackagingURIHelper.createPartName("/word/" + id + ".html");
PackagePart part = oPCPackage.createPart(partName, "text/html");
MyXWPFHtmlDocument myXWPFHtmlDocument = new MyXWPFHtmlDocument(part, id);
document.addRelation(myXWPFHtmlDocument.getId(), new XWPFHtmlRelation(), myXWPFHtmlDocument);
return myXWPFHtmlDocument;
}
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph;
XWPFRun run;
MyXWPFHtmlDocument myXWPFHtmlDocument;
paragraph = document.createParagraph();
run = paragraph.createRun();
run.setText("Default paragraph followed by first HTML chunk.");
myXWPFHtmlDocument = createHtmlDoc(document, "htmlDoc1");
myXWPFHtmlDocument.setHtml(myXWPFHtmlDocument.getHtml().replace("<body></body>",
"<body><p>Simple <b>HTML</b> <i>formatted</i> <u>text</u></p></body>"));
document.getDocument().getBody().addNewAltChunk().setId(myXWPFHtmlDocument.getId());
paragraph = document.createParagraph();
run = paragraph.createRun();
run.setText("Default paragraph followed by second HTML chunk.");
myXWPFHtmlDocument = createHtmlDoc(document, "htmlDoc2");
myXWPFHtmlDocument.setHtml(myXWPFHtmlDocument.getHtml().replace("<body></body>",
"<body>" +
"<table>"+
"<caption>A table></caption>" +
"<tr><th>Name</th><th>Date</th><th>Amount</th></tr>" +
"<tr><td>John Doe</td><td>2018-12-01</td><td>1,234.56</td></tr>" +
"</table>" +
"</body>"
));
document.getDocument().getBody().addNewAltChunk().setId(myXWPFHtmlDocument.getId());
FileOutputStream out = new FileOutputStream("CreateWordWithHTMLaltChunk.docx");
document.write(out);
out.close();
document.close();
}
//a wrapper class for the htmlDoc /word/htmlDoc#.html in the *.docx ZIP archive
//provides methods for manipulating the HTML
//TODO: We should *not* using String methods for manipulating HTML!
private static class MyXWPFHtmlDocument extends POIXMLDocumentPart {
private String html;
private String id;
private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
super(part);
this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>";
this.id = id;
}
private String getId() {
return id;
}
private String getHtml() {
return html;
}
private void setHtml(String html) {
this.html = html;
}
#Override
protected void commit() throws IOException {
PackagePart part = getPackagePart();
OutputStream out = part.getOutputStream();
Writer writer = new OutputStreamWriter(out, "UTF-8");
writer.write(html);
writer.close();
out.close();
}
}
//the XWPFRelation for /word/htmlDoc#.html
private final static class XWPFHtmlRelation extends POIXMLRelation {
private XWPFHtmlRelation() {
super(
"text/html",
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk",
"/word/htmlDoc#.html");
}
}
}
Note: Because of using altChunk this code needs the full jar of all of the schemas ooxml-schemas-*.jar as mentioned in apache poi faq-N10025.
Result:

Based on Axel Richter's answer, I replaced the call to CTBody.addNewAltChunk() with CTBodyImpl.get_store().add_element_user(QName) which eliminates the added 15MB dependency on ooxml-schemas. Since this is being used in a desktop app, we are trying to keep the app size as small as possible. In case it may be of help to anyone else:
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
import javax.xml.namespace.QName;
import org.apache.poi.ooxml.POIXMLDocumentPart;
import org.apache.poi.ooxml.POIXMLRelation;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import org.apache.poi.openxml4j.opc.PackagePartName;
import org.apache.poi.openxml4j.opc.PackagingURIHelper;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.xmlbeans.SimpleValue;
import org.apache.xmlbeans.impl.values.XmlComplexContentImpl;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.impl.CTBodyImpl;
public class AltChunkTest {
public static void main(String[] args) throws Exception {
XWPFDocument doc = new XWPFDocument();
doc.createParagraph().createRun().setText("AltChunk below:");
addHtml(doc,"chunk1","<!DOCTYPE html><html><head><style></style><title></title></head><body><b>Hello World!</b></body></html>");
FileOutputStream out = new FileOutputStream(new File("test.docx"));
doc.write(out);
}
static void addHtml(XWPFDocument doc, String id,String html) throws Exception {
OPCPackage oPCPackage = doc.getPackage();
PackagePartName partName = PackagingURIHelper.createPartName("/word/" + id + ".html");
PackagePart part = oPCPackage.createPart(partName, "text/html");
class HtmlRelation extends POIXMLRelation {
private HtmlRelation() {
super( "text/html",
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/aFChunk",
"/word/htmlDoc#.html");
}
}
class HtmlDocumentPart extends POIXMLDocumentPart {
private HtmlDocumentPart(PackagePart part) throws Exception {
super(part);
}
#Override
protected void commit() throws IOException {
try (OutputStream out = part.getOutputStream()) {
try (Writer writer = new OutputStreamWriter(out, "UTF-8")) {
writer.write(html);
}
}
}
};
HtmlDocumentPart documentPart = new HtmlDocumentPart(part);
doc.addRelation(id, new HtmlRelation(), documentPart);
CTBodyImpl b = (CTBodyImpl) doc.getDocument().getBody();
QName ALTCHUNK = new QName("http://schemas.openxmlformats.org/wordprocessingml/2006/main", "altChunk");
XmlComplexContentImpl altchunk = (XmlComplexContentImpl) b.get_store().add_element_user(ALTCHUNK);
QName ID = new QName("http://schemas.openxmlformats.org/officeDocument/2006/relationships", "id");
SimpleValue target = (SimpleValue)altchunk.get_store().add_attribute_user(ID);
target.setStringValue(id);
}
}

This feature is ok in poi-ooxml 4.0.0, where the class POIXMLDocumentPart and POIXMLRelation are in the package org.apache.poi.ooxml.*
import org.apache.poi.ooxml.POIXMLDocumentPart;
import org.apache.poi.ooxml.POIXMLRelation;
But how we can use in poi-ooxml 3.9, where the class are little different and in the org.apache.poi.*
import org.apache.poi.POIXMLDocumentPart;
import org.apache.poi.POIXMLRelation;

Jsoup reddit scraper 429 error

So I'm trying to use jsoup to scrape Reddit for images, but when I scrape certain subreddits such as /r/wallpaper, I get a 429 error and am wondering how to fix this. Totally understand that this code is horrible and this is a pretty noob question, but I'm completely new to this. Anyways:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.io.*;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import java.io.*;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.net.URL;
import java.util.Scanner;
public class javascraper{
public static void main (String[]args) throws MalformedURLException
{
Scanner scan = new Scanner (System.in);
System.out.println("Where do you want to store the files?");
String folderpath = scan.next();
System.out.println("What subreddit do you want to scrape?");
String subreddit = scan.next();
subreddit = ("http://reddit.com/r/" + subreddit);
new File(folderpath + "/" + subreddit).mkdir();
//test
try{
//gets http protocol
Document doc = Jsoup.connect(subreddit).timeout(0).get();
//get page title
String title = doc.title();
System.out.println("title : " + title);
//get all links
Elements links = doc.select("a[href]");
for(Element link : links){
//get value from href attribute
String checkLink = link.attr("href");
Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
if (imgCheck(checkLink)){ // checks to see if img link j
System.out.println("link : " + link.attr("href"));
downloadImages(checkLink, folderpath);
}
}
}
catch (IOException e){
e.printStackTrace();
}
}
public static boolean imgCheck(String http){
String png = ".png";
String jpg = ".jpg";
String jpeg = "jpeg"; // no period so checker will only check last four characaters
String gif = ".gif";
int length = http.length();
if (http.contains(png)|| http.contains("gfycat") || http.contains(jpg)|| http.contains(jpeg) || http.contains(gif)){
return true;
}
else{
return false;
}
}
private static void downloadImages(String src, String folderpath) throws IOException{
String folder = null;
//Exctract the name of the image from the src attribute
int indexname = src.lastIndexOf("/");
if (indexname == src.length()) {
src = src.substring(1, indexname);
}
indexname = src.lastIndexOf("/");
String name = src.substring(indexname, src.length());
System.out.println(name);
//Open a URL Stream
URL url = new URL(src);
InputStream in = url.openStream();
OutputStream out = new BufferedOutputStream(new FileOutputStream( folderpath+ name));
for (int b; (b = in.read()) != -1;) {
out.write(b);
}
out.close();
in.close();
}
}

Your issue is caused by the fact that your scraper is violating reddit's API rules. Error 429 means "Too many requests" – you're requesting too many pages too fast.
You can make one request every 2 seconds, and you also need to set a proper user agent (they format they recommend is <platform>:<app ID>:<version string> (by /u/<reddit username>)). The way it currently looks, your code is running too fast and doesn't specify one, so it's going to be severely rate-limited.
To fix it, first off, add this to the start of your class, before the main method:
public static final String USER_AGENT = "<PUT YOUR USER AGENT HERE>";
(Make sure to specify an actual user agent).
Then, change this (in downloadImages)
URL url = new URL(src);
InputStream in = url.openStream();
to this:
URLConnection connection = (new URL(src)).openConnection();
Thread.sleep(2000); //Delay to comply with rate limiting
connection.setRequestProperty("User-Agent", USER_AGENT);
InputStream in = connection.getInputStream();
You'll also want to change this (in main)
Document doc = Jsoup.connect(subreddit).timeout(0).get();
to this:
Document doc = Jsoup.connect(subreddit).userAgent(USER_AGENT).timeout(0).get();
Then your code should stop running into that error.
Note that using reddit's API (IE, /r/subreddit.json instead of /r/subreddit) would probably make this project easier, but it isn't required and your current code will work.

As you can look up at Wikipedia the 429 status code tells you that you have too many requests:
The user has sent too many requests in a given amount of time. Intended for use with rate limiting schemes.
A solution would be to slow down your scraper. There are some options how to do this, one would be to use sleep.

Map a image file through Spring controller

Is there any way to map a image file using a spring controller? In my spring application, I want store the images in the directory src/main/resources (i'm using maven) and access them with a method like this:
#RequestMapping(value="image/{theString}")
public ModelAndView image(#PathVariable String theString) {
return new ModelAndView('what should be placed here?');
}
the string theString it's the image name (without extension). With this approach, I should be able to access my images this way:
/webapp/controller_mapping/image/image_name
Anyone can point a direction to do that?

You can return HttpEntity<byte[]>. Construct new instance providing image byte array and necessary headers like content length and mime type then return it from your method. Image bytes can be obtained using classloader getResourceAsStream method.

This works for me. It could use some cleaning up but it works. The ServiceException is just a simple base exception.
Good Luck!
package com.dhargis.example;
import java.io.File;
import java.io.IOException;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import org.apache.commons.io.FileUtils;
import org.apache.log4j.Logger;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.PathVariable;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
#Controller
#RequestMapping("/image")
public class ImageController {
private static final Logger log = Logger.getLogger(ImageController.class);
private String filestore = "C:\\Users\\dhargis";
//produces = "application/octet-stream"
#RequestMapping(value = "/{filename:.+}", method = RequestMethod.GET)
public void get( #PathVariable String filename,
HttpServletRequest request,
HttpServletResponse response) {
log.info("Getting file " + filename);
try {
byte[] content = null;
File store = new File(filestore);
if( store.exists() ){
File file = new File(store.getPath()+File.separator+filename);
if( file.exists() ){
content = FileUtils.readFileToByteArray(file);
} else {
throw new ServiceException("File does not exist");
}
} else {
throw new ServiceException("Report store is required");
}
ServletOutputStream out = response.getOutputStream();
out.write(content);
out.flush();
out.close();
} catch (ServiceException e) {
log.error("Error on get", e);
} catch (IOException e) {
log.error("Error on get", e);
}
}
}
<!-- begin snippet: js hide: false -->

How to convert HTML To PDF and open pdf file, using java with YaHP Html to Pdf Convertor

Am using a YaHP-Converter to convert HTML File to Pdf. Here is the code example i have used for converting. The code works me fine. But i want open Pdf file after this conversion.
Any idea please.
CYaHPConverter converter = new CYaHPConverter();
FileOutputStream out = new FileOutputStream(pdfOut);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
properties.put(IHtmlToPdfTransformer.PDF_RENDERER_CLASS,IHtmlToPdfTransformer.FLYINGSAUCER_PDF_RENDERER);
converter.convertToPdf(htmlContents,
IHtmlToPdfTransformer.LEGALL,
headerFooterList,
"file:///D:/temp/",
out,
properties);
Thanks in advance

I think this helps:
import java.io.File;
import java.io.FileOutputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
// http://www.allcolor.org/YaHPConverter/
import org.allcolor.yahp.converter.CYaHPConverter;
import org.allcolor.yahp.converter.IHtmlToPdfTransformer;
public class HtmlToPdf_yahp_2 {
public static void main(String ... args ) throws Exception {
String root = "c:/temp/html";
String input = "file_1659686.htm"; // need to be charset utf-8
htmlToPdfFile(new File(root, input),
new File(root, input + ".pdf"));
System.out.println("Done");
}
public static void htmlToPdfFile(File htmlIn, File pdfOut) throws Exception {
Scanner scanner =
new Scanner(htmlIn).useDelimiter("\\Z");
String htmlContents = scanner.next();
CYaHPConverter converter = new CYaHPConverter();
FileOutputStream out = new FileOutputStream(pdfOut);
Map properties = new HashMap();
List headerFooterList = new ArrayList();
properties.put(IHtmlToPdfTransformer.PDF_RENDERER_CLASS,
IHtmlToPdfTransformer.FLYINGSAUCER_PDF_RENDERER);
//properties.put(IHtmlToPdfTransformer.FOP_TTF_FONT_PATH, fontPath);
converter.convertToPdf(htmlContents,
IHtmlToPdfTransformer.A4P,
headerFooterList,
"file:///temp/html/",
out,
properties);
out.flush();
out.close();
}
}
See this for futher info:
http://www.rgagnon.com/javadetails/java-convert-html-to-pdf-using-yahp.html

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.
When using the following code:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());
the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.
The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:
File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
System.out.println(paragraph);
}
I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?
If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:
/*
* This class is used to read .doc and .docx files
*
* #author Developer
*
*/
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.io.OutputStreamWriter;
import java.net.URL;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
class TextExtractor {
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;
public TextExtractor() {
context = new ParseContext();
detector = new DefaultDetector();
parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
outputstream = new ByteArrayOutputStream();
metadata = new Metadata();
}
public void process(String filename) throws Exception {
URL url;
File file = new File(filename);
if (file.isFile()) {
url = file.toURI().toURL();
} else {
url = new URL(filename);
}
InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputstream);
parser.parse(input, handler, metadata, context);
input.close();
}
public void getString() {
//Get the text into a String object
extractedText = outputstream.toString();
//Do whatever you want with this String object.
System.out.println(extractedText);
}
public static void main(String args[]) throws Exception {
if (args.length == 1) {
TextExtractor textExtractor = new TextExtractor();
textExtractor.process(args[0]);
textExtractor.getString();
} else {
throw new Exception();
}
}
}
To compile:
javac -cp ".:tika-app-1.2.jar" TextExtractor.java
To run:
java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc

There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).
The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:
NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());
for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}
The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:
TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();

Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.
InputStream inputstream = new FileInputStream(m_filepath);
//read the file
XWPFDocument adoc= new XWPFDocument(inputstream);
//and place it in a xwpf format
aString = new XWPFWordExtractor(adoc).getText();
//gets the full text
Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this
for (XWPFParagraph p : adoc.getParagraphs())
{
System.out.println(p.getParagraphText());
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert DOCX to HTML incliding IMAGES - java

Related

How to add an altChunk element to a XWPFDocument using Apache POI

Jsoup reddit scraper 429 error

Map a image file through Spring controller

How to convert HTML To PDF and open pdf file, using java with YaHP Html to Pdf Convertor

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

Categories

Resources