Java library for reading Word documents - java

Is there an open-source Java library for reading Word documents (both .docx and the older .doc format)?
Read-only access if sufficient; I do not need to modify the Word documents using Java. However, I would like to have access to images and style information.
EDIT
I've checked out Apache POI, but it doesn't look like it is being actively maintained. See http://poi.apache.org/hwpf/index.html:
At the moment we unfortunately do not have someone taking care for HWPF and fostering its development.

Apache POI HWPF for .doc and XWPF for .docx files

There is an apache project that does this: http://poi.apache.org//

public class XParseTest
{
public static void main(String[] args) throws XmlException, OpenXML4JException, IOException
{
File file=new File("e:\\testing\\new.docx");
FileInputStream fs = new FileInputStream(file);
OPCPackage d = OPCPackage.open(fs);
XWPFWordExtractor xw = new XWPFWordExtractor(d);
System.out.println(xw.getText());
}
}
this will parse docx file...

Related

Read excel file in java without using any jars

I have a requirement to read excel file in java without using any third party library jar like POI,JEXCEL .I don't know exactly can spring support for same? Please suggest if you know something.
Thanks in advance
Using POI I have done but read without using any jar
public static void readFromExcel(String file) throws IOException{
HSSFWorkbook myExcelBook = new HSSFWorkbook(new FileInputStream(file));
HSSFSheet myExcelSheet = myExcelBook.getSheet("Birthdays");
HSSFRow row = myExcelSheet.getRow(0);
if(row.getCell(0).getCellType() == HSSFCell.CELL_TYPE_STRING){
String name = row.getCell(0).getStringCellValue();
System.out.println("name : " + name);
}
If you are not allowed to use a 3rd party JAR or library you will need to write a parser to read the document into your data classes.
I would advise you to take a look at the file format specification for Microsoft Office. You will need to understand this to build a reliable parser.
It would be much easier to just use Apache POI and see if the requirements can be changed to allow it.

how to know whether a file is .docx or .doc format from Apache POI

I know we can get it done by extension or by mime type, do we have any other way through which we can get the idea of type of file whether it is .docx or .doc.
If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)
You could use Apache Tika for content Detection. But you should been aware that this is a huge framework (many required dependencies) for such a small task.
There is a way, no strightforward though. But with Apache POI, you can locate it.
Try to read a .docx file using HWPFDocument Class. It would give you the following error
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied
data appears to be in the Office 2007+ XML. You are calling the part
of POI that deals with OLE2 Office Documents. You need to call a
different part of POI to process this data (eg XSSF instead of HSSF)
String filePath = "C:\\XXXX\XXXX.docx";
FileInputStream inStream;
try {
inStream = new FileInputStream(new File(filePath));
HWPFDocument doc = new HWPFDocument(inStream);
WordExtractor wordExtractor = new WordExtractor(doc);
System.out.println("Getting words"+wordExtractor.getText());
} catch (Exception e) {
System.out.print("Its not a .doc format");
}
.docx can be read using XWPFDocument Class.
Why dont you use Apache Tika:
File file = new File('File Here');
Tika tika = new Tika();
String filetype = tika.detect(file);
System.out.println(filetype);
Assuming you're using Apache POI, you have a few options.
One is to grab the first few bytes of the file, and ask POIFSFileSystem with the hasPOIFSHeader(byte) method. If you have a stream that supports mark/reset, you can instead use POIFSFileSystem.hasPOIFSHeader(InputStream). If those return true then try to open it as a .doc with HWPF, otherwise try as .docx with XWPF
Otherwise, if you prefer a try/catch way, try to open it with POIFSFileSystem and catch OfficeXmlFileException - if it opens fine it's .doc, if you get the exception it's .docx
If you look at the source code for WorkbookFactory you'll see the first pattern in use, you can copy a similar set of logic form that

While Reading the data from Excel file with extension xlsx using apache poi it takes long time

While reading the excel file with extension xlsx using apache poi it takes the long time for identifying the extension. Can you please help why it takes the long time?
if (file.getExcelFile().getOriginalFilename().endsWith("xls"))
{
workbook = new HSSFWorkbook(file.getExcelFile().getInputStream());
} else if (file.getExcelFile().getOriginalFilename().endsWith("xlsx"))
{
workbook = new XSSFWorkbook(file.getExcelFile().getInputStream());
} else {
throw new IllegalArgumentException("Received file does not have a standard excel extension.");
}
Promoting a comment to an answer - don't try to do this yourself, Apache POI has built-in code for doing this for you!
You should use WorkbookFactory.create(File) to do it, eg just
workbook = WorkbookFactory.create(file.getExcelFile());
As explained in the Apache POI docs, use a File directly in preference to an InputStream for quicker and lower memory processing

Error in read .doc and .docx file's content

I want to read a .txt, .doc and .docx files and print the contents of those files.when i run the below code some .doc and .txt files are read but many files are not able to read.
import java.io.File;
import javax.swing.*;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
public class FindYourDocx
{
public static void main(String[] args)
{
String text = "";
int read, N = 1024 * 1024;
char[] buffer = new char[N];
try {
JFileChooser openFile=new JFileChooser();
openFile.setCurrentDirectory(new File("."));
openFile.showOpenDialog(null);
File f1=openFile.getSelectedFile();
String file1=f1.toString();
File f =new File(file1);
JOptionPane.showMessageDialog(null,f);
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
while(true) {
read = br.read(buffer, 0, N);
text += new String(buffer, 0, read);
System.out.println("Follows"+text+" ");
if(read < N) {
break;
}
System.out.println("Follows"+text+" "); }
} catch(Exception ex) {
ex.printStackTrace();
}
}}
by executing the above code (for some files) i got some wired messages as follows
http://i.stack.imgur.com/RwNWM.jpg
Someone please help me to solve this issues....
to read .docx i came across something like XWPFDocument using apacheio ....what is this ?
First of all you should think about your problem: What do different file types look like as a file, what is their structure, what's the content which you would like to print and what does "printing" mean at all? What your are doing is reading files, treating them as text and printing them to STDOUT. Does "printing" mean this in your case? I interpret "printing" as being able to send content to a printer and get some paper.
Another hint: Doc and Docx are binary files, which contain "printable" text "somewhere". You can't just read the files and do something with the data. You need to know how those file formats look like, were the content is etc. Java can't do that out of the box, you need additional libraries to parse those file formats and do something with them.
There are many tutorials and questions around formats like docx:
How to read docx file content in java api using poi jar
to read .docx i came across something like XWPFDocument using apacheio ....what is this ?
You mean Apache POI. To find out more, check the website. In brief, both Apache POI and docx4j (which I note you have tagged) are Java libraries aimed at reading, manipulating, and writing Microsoft Office files.
'doc' files are Microsoft proprietary binary files. If you try to read them in and display them using the Java IO API alone, all you will see is a representation of the binary data. It won't be useful to you. You need to use an API specifically for loading up and traversing Word files, which is where Apache POI or docx4j come in.
'docx' files are a newer XML-based Microsoft Office format. A docx file is essentially a zipped folder containing the various assets that make up a Word file.
As I said, in order to read a Word file properly, you will need to use one of the libraries mentioned. Both the Apache and docx4j websites contain plenty of example code to get you started opening and traversing Word documents (note that POI can work with the older .doc format, whereas docx4j is only for .docx files).
http://www.docx4java.org
http://poi.apache.org

Viewing .doc file with java applet

I have a web application. I've generated MS Word document in xml format (Word 2003 XML Document) on server side. I need to show this document to a user on a client side using some kind of viewer. So, question is: what libraries I can use to solve this problem? I need an API to view word document on client side using java.
You cannot reliably display a Word document in a web page using Java (or any other simple technology for that matter). There are several commercial libraries out there to render Word, but you will not find these to be easy, cheap or reliable solutions.
What you should do is the following:
(1) Open the Word engine on the server using a .NET program
(2) Convert the document to Rich Text using the Word engine
(3) Display the rich text either using the RTF Swing widget, or convert to HTML:
String rtf = [your document rich text];
BufferedReader input = new BufferedReader(new StringReader(rtf));
RTFEditorKit rtfKit = new RTFEditorKit();
StyledDocument doc = (StyledDocument) rtfKit.createDefaultDocument();
rtfEdtrKt.read( input, doc, 0 );
input.close();
HTMLEditorKit htmlKit = new HTMLEditorKit();
StringWriter output = new StringWriter();
htmlKit.write( output, doc, 0, doc.getLength());
String html = output.toString();
The main risk in this approach is that the Word engine will either crash or have a memory leak. For this reason you have to have a mechanism for restarting it periodically and testing it to make sure it is functional and not hogging memory.
docx4all is a Swing-based applet which does Word 2007 XML (ie not Word 2003 XML), which we wrote several years ago.
Get it from svn.
That's a possible approach for editing. If all you want is a viewer, which not convert to HTML or PDF? You can use docx4j for that. (Disclosure: "my" project).
You might have a look at the Apache POI - Java API to Handle Microsoft Word Files which is able to read all kinds of word documents (OLE2 and OOXML formats, .doc and .docx extensions respectively).
Reading a doc file can be easy as:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
You can find more at: HWPF Quick-Guide (specifically HWPF unit tests)
Note that, according to the POI site:
HWPF is still in early development.
I'd suggest looking at the openoffice source code and implement that.
It's supposed to be written in java.

Categories