Error in read .doc and .docx file's content

Error in read .doc and .docx file's content - java

I want to read a .txt, .doc and .docx files and print the contents of those files.when i run the below code some .doc and .txt files are read but many files are not able to read.
import java.io.File;
import javax.swing.*;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
public class FindYourDocx
{
public static void main(String[] args)
{
String text = "";
int read, N = 1024 * 1024;
char[] buffer = new char[N];
try {
JFileChooser openFile=new JFileChooser();
openFile.setCurrentDirectory(new File("."));
openFile.showOpenDialog(null);
File f1=openFile.getSelectedFile();
String file1=f1.toString();
File f =new File(file1);
JOptionPane.showMessageDialog(null,f);
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
while(true) {
read = br.read(buffer, 0, N);
text += new String(buffer, 0, read);
System.out.println("Follows"+text+" ");
if(read < N) {
break;
}
System.out.println("Follows"+text+" "); }
} catch(Exception ex) {
ex.printStackTrace();
}
}}
by executing the above code (for some files) i got some wired messages as follows
http://i.stack.imgur.com/RwNWM.jpg
Someone please help me to solve this issues....
to read .docx i came across something like XWPFDocument using apacheio ....what is this ?

First of all you should think about your problem: What do different file types look like as a file, what is their structure, what's the content which you would like to print and what does "printing" mean at all? What your are doing is reading files, treating them as text and printing them to STDOUT. Does "printing" mean this in your case? I interpret "printing" as being able to send content to a printer and get some paper.
Another hint: Doc and Docx are binary files, which contain "printable" text "somewhere". You can't just read the files and do something with the data. You need to know how those file formats look like, were the content is etc. Java can't do that out of the box, you need additional libraries to parse those file formats and do something with them.
There are many tutorials and questions around formats like docx:
How to read docx file content in java api using poi jar

to read .docx i came across something like XWPFDocument using apacheio ....what is this ?
You mean Apache POI. To find out more, check the website. In brief, both Apache POI and docx4j (which I note you have tagged) are Java libraries aimed at reading, manipulating, and writing Microsoft Office files.
'doc' files are Microsoft proprietary binary files. If you try to read them in and display them using the Java IO API alone, all you will see is a representation of the binary data. It won't be useful to you. You need to use an API specifically for loading up and traversing Word files, which is where Apache POI or docx4j come in.
'docx' files are a newer XML-based Microsoft Office format. A docx file is essentially a zipped folder containing the various assets that make up a Word file.
As I said, in order to read a Word file properly, you will need to use one of the libraries mentioned. Both the Apache and docx4j websites contain plenty of example code to get you started opening and traversing Word documents (note that POI can work with the older .doc format, whereas docx4j is only for .docx files).
http://www.docx4java.org
http://poi.apache.org

Related

how to know whether a file is .docx or .doc format from Apache POI

I know we can get it done by extension or by mime type, do we have any other way through which we can get the idea of type of file whether it is .docx or .doc.

If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)

You could use Apache Tika for content Detection. But you should been aware that this is a huge framework (many required dependencies) for such a small task.

There is a way, no strightforward though. But with Apache POI, you can locate it.
Try to read a .docx file using HWPFDocument Class. It would give you the following error
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied
data appears to be in the Office 2007+ XML. You are calling the part
of POI that deals with OLE2 Office Documents. You need to call a
different part of POI to process this data (eg XSSF instead of HSSF)
String filePath = "C:\\XXXX\XXXX.docx";
FileInputStream inStream;
try {
inStream = new FileInputStream(new File(filePath));
HWPFDocument doc = new HWPFDocument(inStream);
WordExtractor wordExtractor = new WordExtractor(doc);
System.out.println("Getting words"+wordExtractor.getText());
} catch (Exception e) {
System.out.print("Its not a .doc format");
}
.docx can be read using XWPFDocument Class.

Why dont you use Apache Tika:
File file = new File('File Here');
Tika tika = new Tika();
String filetype = tika.detect(file);
System.out.println(filetype);

Assuming you're using Apache POI, you have a few options.
One is to grab the first few bytes of the file, and ask POIFSFileSystem with the hasPOIFSHeader(byte) method. If you have a stream that supports mark/reset, you can instead use POIFSFileSystem.hasPOIFSHeader(InputStream). If those return true then try to open it as a .doc with HWPF, otherwise try as .docx with XWPF
Otherwise, if you prefer a try/catch way, try to open it with POIFSFileSystem and catch OfficeXmlFileException - if it opens fine it's .doc, if you get the exception it's .docx
If you look at the source code for WorkbookFactory you'll see the first pattern in use, you can copy a similar set of logic form that

Viewing .doc file with java applet

I have a web application. I've generated MS Word document in xml format (Word 2003 XML Document) on server side. I need to show this document to a user on a client side using some kind of viewer. So, question is: what libraries I can use to solve this problem? I need an API to view word document on client side using java.

You cannot reliably display a Word document in a web page using Java (or any other simple technology for that matter). There are several commercial libraries out there to render Word, but you will not find these to be easy, cheap or reliable solutions.
What you should do is the following:
(1) Open the Word engine on the server using a .NET program
(2) Convert the document to Rich Text using the Word engine
(3) Display the rich text either using the RTF Swing widget, or convert to HTML:
String rtf = [your document rich text];
BufferedReader input = new BufferedReader(new StringReader(rtf));
RTFEditorKit rtfKit = new RTFEditorKit();
StyledDocument doc = (StyledDocument) rtfKit.createDefaultDocument();
rtfEdtrKt.read( input, doc, 0 );
input.close();
HTMLEditorKit htmlKit = new HTMLEditorKit();
StringWriter output = new StringWriter();
htmlKit.write( output, doc, 0, doc.getLength());
String html = output.toString();
The main risk in this approach is that the Word engine will either crash or have a memory leak. For this reason you have to have a mechanism for restarting it periodically and testing it to make sure it is functional and not hogging memory.

docx4all is a Swing-based applet which does Word 2007 XML (ie not Word 2003 XML), which we wrote several years ago.
Get it from svn.
That's a possible approach for editing. If all you want is a viewer, which not convert to HTML or PDF? You can use docx4j for that. (Disclosure: "my" project).

You might have a look at the Apache POI - Java API to Handle Microsoft Word Files which is able to read all kinds of word documents (OLE2 and OOXML formats, .doc and .docx extensions respectively).
Reading a doc file can be easy as:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
You can find more at: HWPF Quick-Guide (specifically HWPF unit tests)
Note that, according to the POI site:
HWPF is still in early development.

I'd suggest looking at the openoffice source code and implement that.
It's supposed to be written in java.

reading/extracting a self extracting zip in JAVA

I was trying to read a self-extracting zip (located here ftp://ftp.dnr.state.oh.us/OilGas/Download/Production/By_Year/2010Production.exe) using java code.
I tried three approaches, the one mentioned at How can I read from a Winzip self-extracting (exe) zip file in Java?
and the second one is to download the exe file and rename it to zip (thought the cheat might work)and then tried to read it...Both of them didn't work.
The final one using the 7-ZIP LZMA SDK, which is also not useful
Also, I looked at several other resources on Internet but nothing useful. Can some one please help me?

TrueZip works best in this case. (Atleast in my case)
The self extracting zip is of the following format code1 header1 file1 (while a normal zip is of the format header1 file1)...The code tells on how to extract the zip
Though the Truezip extracting utility complains about the extra bytes and throws an exception
Here is the code
private boolean Extract(String src, String dst, String incPath) {
TFile srcFile = new TFile(src, incPath);
TFile dstFile = new TFile(dst);
try {
TFile.cp_rp(srcFile, dstFile, TArchiveDetector.NULL);
} catch (IOException e) {
return true;
}
return true;
}
You can call this method like Extract(new String("C:\2006Production.exe"), new String("c:\") , "");
You can download the Truezip source files package (jar) from here http://repo1.maven.org/maven2/de/schlichtherle/truezip/truezip-samples/7.5.5/truezip-samples-7.5.5-jar-with-dependencies.jar
You will need to import the classes in your code.
import de.schlichtherle.truezip.file.TArchiveDetector;
import de.schlichtherle.truezip.file.TFile;
The file is extracted in the c drive...you can perform your own operation on your file. I hope this helps.
Thanks.

Apache Commons Compress supports this.

Java library for reading Word documents

Is there an open-source Java library for reading Word documents (both .docx and the older .doc format)?
Read-only access if sufficient; I do not need to modify the Word documents using Java. However, I would like to have access to images and style information.
EDIT
I've checked out Apache POI, but it doesn't look like it is being actively maintained. See http://poi.apache.org/hwpf/index.html:
At the moment we unfortunately do not have someone taking care for HWPF and fostering its development.

Apache POI HWPF for .doc and XWPF for .docx files

There is an apache project that does this: http://poi.apache.org//

public class XParseTest
{
public static void main(String[] args) throws XmlException, OpenXML4JException, IOException
{
File file=new File("e:\\testing\\new.docx");
FileInputStream fs = new FileInputStream(file);
OPCPackage d = OPCPackage.open(fs);
XWPFWordExtractor xw = new XWPFWordExtractor(d);
System.out.println(xw.getText());
}
}
this will parse docx file...

Parsing a CSV File In JSP

I am creating a webApp that will download some data from Yahoo Finance into a CSV file and then (hopefully) be able to then read the created CSV data into a HTML table.
I have successfully got the program to connect to the Yahoo feed and then download that data into a CSV file and would not like to use the data from the file into a table.
Below is the code i used to create the CSV file:
String ticker = request.getParameter("stockSym");
URL url = new URL("http://finance.yahoo.com/d/quotes.csv?s=" + ticker + "&f=abc");
InputStream in = url.openStream();
BufferedInputStream bufIn = new BufferedInputStream(in);
File f=new File("stockInfo.csv");
FileOutputStream fop=new FileOutputStream(f);
for (;;)
{
int data = bufIn.read();
// Check for EOF
if (data == -1)
break;
else
fop.write((char) data);
}
fop.flush();
fop.close();
Are there any JSP programmers who would know how to open and then parse a CSV file into a table or would know of any good links to tutorials on how to accomplish this task?
Thanks.

Do not write the CSV parsing code yourself, use one of the many available libraries.
They can handle tricky things like line breaks, quotes, embedded commas and so on.

There is open source routine to parse CSV files here:
http://scm.opendap.org:8090/trac/browser/trunk/ODC/src/opendap/clients/odc/Utility.java
This utility class has a bunch a different functionality in it. Just pull out what is needed to parse the CSV.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Error in read .doc and .docx file's content - java

Related

how to know whether a file is .docx or .doc format from Apache POI

Viewing .doc file with java applet

reading/extracting a self extracting zip in JAVA

Java library for reading Word documents

Parsing a CSV File In JSP

Categories

Resources