Open Microsoft Word in Java - java

I'm trying to open MS Word 2003 document in java, search for a specified String and replace it with a new String. I use APACHE POI to do that. My code is like the following one:
public void searchAndReplace(String inputFilename, String outputFilename,
HashMap<String, String> replacements) {
File outputFile = null;
File inputFile = null;
FileInputStream fileIStream = null;
FileOutputStream fileOStream = null;
BufferedInputStream bufIStream = null;
BufferedOutputStream bufOStream = null;
POIFSFileSystem fileSystem = null;
HWPFDocument document = null;
Range docRange = null;
Paragraph paragraph = null;
CharacterRun charRun = null;
Set<String> keySet = null;
Iterator<String> keySetIterator = null;
int numParagraphs = 0;
int numCharRuns = 0;
String text = null;
String key = null;
String value = null;
try {
// Create an instance of the POIFSFileSystem class and
// attach it to the Word document using an InputStream.
inputFile = new File(inputFilename);
fileIStream = new FileInputStream(inputFile);
bufIStream = new BufferedInputStream(fileIStream);
fileSystem = new POIFSFileSystem(bufIStream);
document = new HWPFDocument(fileSystem);
docRange = document.getRange();
numParagraphs = docRange.numParagraphs();
keySet = replacements.keySet();
for (int i = 0; i < numParagraphs; i++) {
paragraph = docRange.getParagraph(i);
text = paragraph.text();
numCharRuns = paragraph.numCharacterRuns();
for (int j = 0; j < numCharRuns; j++) {
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
System.out.println("Character Run text: " + text);
keySetIterator = keySet.iterator();
while (keySetIterator.hasNext()) {
key = keySetIterator.next();
if (text.contains(key)) {
value = replacements.get(key);
charRun.replaceText(key, value);
docRange = document.getRange();
paragraph = docRange.getParagraph(i);
charRun = paragraph.getCharacterRun(j);
text = charRun.text();
}
}
}
}
bufIStream.close();
bufIStream = null;
outputFile = new File(outputFilename);
fileOStream = new FileOutputStream(outputFile);
bufOStream = new BufferedOutputStream(fileOStream);
document.write(bufOStream);
} catch (Exception ex) {
System.out.println("Caught an: " + ex.getClass().getName());
System.out.println("Message: " + ex.getMessage());
System.out.println("Stacktrace follows.............");
ex.printStackTrace(System.out);
}
}
I call this function with following arguments:
HashMap<String, String> replacements = new HashMap<String, String>();
replacements.put("AAA", "BBB");
searchAndReplace("C:/Test.doc", "C:/Test1.doc", replacements);
When the Test.doc file contains a simple line like this : "AAA EEE", it works successfully, but when i use a complicated file it will read the content successfully and generate the Test1.doc file but when I try to open it, it will give me the following error:
Word unable to read this document. It may be corrupt.
Try one or more of the following:
* Open and repair the file.
* Open the file with Text Recovery converter.
(C:\Test1.doc)
Please tell me what to do, because I'm a beginner in POI and I have not found a good tutorial for it.

First of all you should be closing your document.
Besides that, what I suggest doing is resaving your original Word document as a Word XML document, then changing the extension manually from .XML to .doc . Then look at the XML of the actual document you're working with and trace the content to make sure you're not accidentally editing hexadecimal values (AAA and EEE could be hex values in other fields).
Without seeing the actual Word document it's hard to say what's going on.
There is not much documentation about POI at all, especially for Word document unfortunately.

I don't know : is its OK to answer myself, but Just to share the knowledge, I'll answer myself.
After navigating the web, the final solution i found is :
The Library called docx4j is very good for dealing with MS docx file, although its documentation is not enough till now and its forum is still in a beginning steps, but overall it help me to do what i need..
Thanks 4 all who help me..

You could try OpenOffice API, but there arent many resources out there to tell you how to use it.

You can also try this one: http://www.dancrintea.ro/doc-to-pdf/

Looks like this could be the issue.

Related

read docx document using java

I have a project steganography to hide docx document into jpeg image. Using apache POI, I can run it and read docx document but only letters can be read.
Even though there are pictures in it.
Here is the code
FileInputStream in = null;
try
{
in = new FileInputStream(directory);
XWPFDocument datax = new XWPFDocument(in);
XWPFWordExtractor extract = new XWPFWordExtractor(datax);
String DataFinal = extract.getText();
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String line = null;
this.isi_file = extract.getText();
}
catch (IOException x) {}
System.out.println("isi :" + this.isi_file);
How can I read all component in the docx document using java? Please help me and thank you for your helping.
Please check documentation for XWPFDocument class. It contains some useful methods, for example:
getAllPictures() returns list of all pictures in document;
getTables() returns list of all tables in document.
In your code snippet exists line XWPFDocument datax = new XWPFDocument(in);. So after that line your can write some code like:
// process all pictures in document
for (XWPFPictureData picture : datax.getAllPictures()) {
// get each picture as byte array
byte[] pictureData = picture.getData();
// process picture somehow
...
}

Converting part of .dox document to html using Apache POI

I use XHTMLConverter to convert .docx to html, to make preview of the document. Is there any way to convert only few pages from original document? I'll be grateful for any help.
You have to parse the complete .docx file. It is not possible to read just parts of it. Otherwise if you want to know how to select a specific page number, im afraid to tell you(at least I believe) that word does not store page numbers therefore there is no function in the libary to accsess a specified page..
(I've read this at another forum, it actually might be false information).
PS: the Excel POI contains a .getSheetAt()method (this might helps you for your research)
But there are also other ways to accsess your pages. For instance you could read the lines of your docx document and search for the pagenumbers(might crash if your text contains those numbers though). Another way would be to search for the header of the site which would be more accurate:
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
this should give you the header of the specified page. Same with footer:
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
If this dosen't work. I am not really into that API....
here a little Example for a very sloppy solution:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile
{
public static void main(String[] args)
{
File file = null;
WordExtractor extractor = null;
try
{
file = new File("c:\\New.doc");
FileInputStream fis = new FileInputStream(file.getAbsolutePath());
HWPFDocument document = new HWPFDocument(fis);
extractor = new WordExtractor(document);
String[] fileData = extractor.getParagraphText();
for (int i = 0; i < fileData.length; i++)
{
if (fileData[i].equals("headerPageOne")){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals("headerPageTwo"){
int lastLineOfPageOne = i
}
}
}
catch (Exception exep)
{
exep.printStackTrace();
}
}
}
If you go with this i would recommend you to create a String[] with your headers and refractor the for-loop to a seperate getPages() Method. Therefore your loop would look like:
List<String> = new ArrayList<String>(Arrays.asList("header1","header2","header3","header4"));
for (int i = 0; i < fileData.length; i++)
{
//well there should be a loop for "x" too
if (fileData[i].equals(headerArray[x])){
int firstLineOfPageOne = i;
}
if (fileData[i]).equals(headerArray[x+1]){
int lastLineOfPageOne = i
}
}
You could create an Object(int pageStart, int PageStop), wich would be the product of your method.
I hope it helped you :)

java.net.malformedURL exception

URL stringfile = getXsl("test.xml");
File originFile = new File(stringfile.getFile());
String xml = null;
ByteArrayOutputStream pdfStream = null;
try {
FileInputStream fis = new FileInputStream(originFile);
int length = fis.available();
byte[] readData = new byte[length];
fis.read(readData);
xml = (new String(readData)).trim();
fis.close();
xml = xml.substring(xml.lastIndexOf("<HttpCommandList>")+17, xml.lastIndexOf("</HttpCommandList>"));
String[] splitxml = xml.split("</HttpCommand>");
for (int i = 0; i < splitxml.length; i++) {
tmpxml = splitxml[i].trim() + "</HttpCommand>";
System.out.println("splitxml:" +tmpxml);
pdfStream = new ByteArrayOutputStream();
pdf = new com.lowagie.text.Document();
PdfWriter.getInstance(pdf, pdfStream);
pdf.open();
URL xslToUse = getXsl("test.xsl");
// Here am using some utility class to transform
// generate the XML needed by iText to generate the PDF using MessageBuffer contents
String iTextXml = XmlUtil.transformXml(tmpxml.toString(), xslToUse).trim();
// generate the PDF document by parsing the specified XML file
XmlParser.parse(pdf, new ByteArrayInputStream(iTextXml.getBytes()));
}
For the above code, during the XmlParser am getting java.net.malformedURL exception : no protocol
Am trying to generate the pdf document by parsing the specified xml file.
We could need the actual xml-file to decide what is missing. I expect, that there is no protocol defined, just like this:
192.168.1.2/ (no protocol)
file://192.168.1.2/ (there is one)
And URL seems to need one.
Also try:
new File("somexsl.xlt").toURI().toURL();
See here and here.
It always helps spoilering the complete stacktrace. No one knows, where the exception actually occured, if you dont post the line numbers.

Any difference in content extracted by pdfbox and itext

I am evaluating pdfbox-1.8.6 and iText-4.2.0(free) for performance. I have noticed that content extraction is faster in iText, but searching words using regex in the content extracted by iText takes longer time than pdfbox.
Extracted content and its size seems same in both cases.
Can anybody explain me on this?
My environment is ubuntu 12.04/java 1.7.
EDIT: adding sample codes.
pdfbox
//in constructor
fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
pdfDocument = PDDocument.load(bis);
textStripper = new PDFTextStripper();
//in extarctContent method
textStripper.setStartPage(pageNo);
textStripper.setEndPage(pageNo);
return textStripper.getText(pdfDocument);
iText
//in constructor
fis = new FileInputStream(file);
reader = new PdfReader(fis);
pdfTextExtractor = new PdfTextExtractor(reader);
//in extarctContent method
return pdfTextExtractor.getTextFromPage(pageNo);
searching words
StopWatch searchTime = ...
StopWatch pdfTime = ...
for (int i = 1; i <= pageCount; i++) {
// fetch one page
pdfTime.resume();
String pageContent = pdfParserService.extractPageContent(i);
pdfTime.suspend();
if (pageContent == null) {
pageContent = "";
}
pageContent = pageContent.replace("-\n", "").replace("\n", "").replace("\\", " ");
searchTime.resume();
Collection list = searchService.search(pageContent, wordList);
searchTime.suspend();
}
pdfTime.stop();
searchTime.stop();
System.out.println(pdfTime.getTime());
System.out.println(searchTime.getTime());

How to use Apache HWPF to extract text and images out of a DOC file

I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.
My very simple program is here:
I have 3 problems now:
Some of packages have errors (they can't find apache hdf). How I can fix them?
How I can use the methods of HWDF to find and extract the images out?
Some piece of my program is incomplete and incorrect. So please help me to complete it.
I have to complete this program in 2 days.
once again I repeat Please Please help me to complete this.
Thanks you Guys a lot for your help!!!
This is my elementary code :
public class test {
public void m1 (){
String filesname = "Hello.doc";
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(filesname );
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String str = we.getText() ;
String[] paragraphs = we.getParagraphText();
Picture pic = new Picture(. . .) ;
pic.writeImageContent( . . . ) ;
PicturesTable picTable = new PicturesTable( . . . ) ;
if ( picTable.hasPicture( . . . ) ){
picTable.extractPicture(..., ...);
picTable.getAllPictures() ;
}
}
Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.
//you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
String fileName = "example.doc";
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
WordExtractor extractor = new WordExtractor(wordDoc);
String[] text = extractor.getParagraphText();
int lineCounter = text.length;
String articleStr = ""; // This string object use to store text from the word document.
for(int index = 0;index < lineCounter;++ index){
String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
int paragraphLength = paragraphStr.length();
if(paragraphLength != 0){
articleStr.concat(paragraphStr);
}
}
//you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
for(int i = 0;i < picturesList.size();++i){
BufferedImage image = null;
Picture pic = picturesList.get(i);
image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
if(image != null){
System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
}
}
If you just want to do this, and you don't care about the coding, you can just use Antiword.
$ antiword file.doc > out.txt
I know this long after the fact but I've found TextMining on google code, more accurate and very easy to use. It is however, pretty much abandoned code.

Categories