Why do I failed to read Excel 2007 using POI? - java

When I try to initialize a Workbook object I always get this error:
The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
But I followed the office sample to do this, following is my code:
File inputFile = new File(inputFileName);
InputStream is = new FileInputStream(inputFile);
Workbook wb = new XSSFWorkbook(is);
Exception occurs at code line:
Workbook wb = new XSSFWorkbook(is);
Here is POI jar including:
poi-3.8-20120326.jar
poi-ooxml-3.8-20120326.jar
poi-ooxml-schemas-3.8-20120326.jar
xmlbeans-2.3.0.jar
Can any guys give me guidance? An example showing how to read a complete Excel 2007 document will be appreciated. Thanks in advance!

I assume that you have recheck that your original file is indeed in Office 2007+XML format, right?
Edit:
Then, if you are sure the format is ok, and it works for you using the WorkbookFactory.create, you can find the answer in the code of such method:
/**
* Creates the appropriate HSSFWorkbook / XSSFWorkbook from
* the given InputStream.
* Your input stream MUST either support mark/reset, or
* be wrapped as a {#link PushbackInputStream}!
*/
public static Workbook create(InputStream inp) throws IOException, InvalidFormatException {
// If clearly doesn't do mark/reset, wrap up
if(! inp.markSupported()) {
inp = new PushbackInputStream(inp, 8);
}
if(POIFSFileSystem.hasPOIFSHeader(inp)) {
return new HSSFWorkbook(inp);
}
if(POIXMLDocument.hasOOXMLHeader(inp)) {
return new XSSFWorkbook(OPCPackage.open(inp));
}
throw new IllegalArgumentException("Your InputStream was neither an OLE2 stream, nor an OOXML stream");
}
This is the bit that you were missing: new XSSFWorkbook(OPCPackage.open(inp))

Related

POI reading Excel file with body in String

Currenty I am trying to read an Excel file that is polled via Apache Camel (2.25.1).
This means the method gets the file contents via a String:
#Handler
public void processFile(#Body String body) {
For reading the Excel file I use Apache POI and POI-ooxml (both 4.1.2).
However, using the String directly
WorkbookFactory.create(new ByteArrayInputStream(body.getBytes(Charset.forName("UTF-8"))))
throws an "java.io.IOException: ZIP entry size is too large or invalid".
Using the String with other encodings:
WorkbookFactory.create(new ByteArrayInputStream(body.getBytes()))
throw "org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file".
Besides, I tried:
File file = exchange.getIn().getBody(File.class);
Workbook workbook = new XSSFWorkbook(new FileInputStream(file));
Probably because the file is read from an FTP-server, a java.io.FileNotFoundException is thrown: Invalid file path
However, the next code does work:
URL url = new URL(fileFtpPath);
URLConnection urlc = url.openConnection();
InputStream ftpIs = urlc.getInputStream();
Workbook workbook = new XSSFWorkbook(ftpIs);
But I prefer not making a connection to the FTP server myself, since Camel has already read the file and the needed Excel contents are available (in String body).
Is there any way to read the contents of the Excel file from the String with Apache POI?
I have my routes in XML, so I use groovy to process excel files, perhaps you may find it helpful
import org.apache.poi.ss.usermodel.WorkbookFactory
def workbook = WorkbookFactory.create(request.getBody(File.class))
def sheet = workbook.getSheetAt(0)
...
There is another approach usually using for large excel files where we are dealing with a stream. To go this way we should implement XSSFSheetXMLHandler.SheetContentsHandler from org.apache.poi.xssf.eventusermodel
You could find a copy of the original POI example in this SO question, for some reason it was recently deleted from poi svn. If you are interested, my groovy version looks like this
import org.apache.poi.openxml4j.opc.OPCPackage
import org.apache.poi.ooxml.util.SAXHelper
import org.apache.poi.xssf.eventusermodel.XSSFReader
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable
import org.apache.poi.hssf.usermodel.HSSFDataFormatter
import org.xml.sax.InputSource
class MyHandler implements XSSFSheetXMLHandler.SheetContentsHandler {
...
}
def pkg = OPCPackage.open(request.getBody(InputStream.class))
def xssfReader = new XSSFReader(pkg)
def sheetParser = SAXHelper.newXMLReader()
def handler = new XSSFSheetXMLHandler(xssfReader.getStylesTable(), null, new ReadOnlySharedStringsTable(pkg), MyHandler, new HSSFDataFormatter(), false)
sheetParser.setContentHandler(handler)
sheetParser.parse(new InputSource(xssfReader.getSheetsData().next()))
You can directly convert the body into InputStream and pass this into XSSFWorkbook constructor
Exchange exchange = consumerTemplate.receive("file://C:/ftp/?noop=true", pollCount);
InputStream stream = exchange.getIn().getBody(InputStream.class);
XSSFWorkbook workbook = new XSSFWorkbook(stream);
XSSFSheet sheet = workbook.getSheetAt(0);

After extracting .xls ole - file is empty with Microsoft Excel but readable with apache-poi

I have a .doc file with an embedded .xls and an embedded .doc.
I can extract both files and save it.
When I want to open the .doc - document everything is fine.
When I want to open the .xls - document it is empty, the editor opens nothing, I also dont see any empty cells nothing.
So I tried to read again with apache-poi the extracted .xls document and when I look at the Sheet-Name or Content of the cells - everything is there.
Do you have any ideas what it is?
My setup is:
apache-poi version 3.15 (I also tried some minor versions)
The word and excel files were created with office 2007.
the code - part:
POIFSFileSystem fs = new POIFSFileSystem(file);
POIOLE2TextExtractor poiole2TextExtractor = ExtractorFactory.createExtractor(fs);
POITextExtractor[] embeddedExtractors = ExtractorFactory.getEmbededDocsTextExtractors(poiole2TextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
// If the embedded object was an Excel spreadsheet.
if (textExtractor instanceof ExcelExtractor) {
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
DirectoryNode directoryNode = (DirectoryNode) excelExtractor.getRoot();
HSSFWorkbook hssfWorkbook = new HSSFWorkbook(directoryNode, true);
File tmp = new File(targetfolder, "test.xls");
FileOutputStream fileOutputStream = new FileOutputStream(tmp);
hssfWorkbook.write(fileOutputStream);
fileOutputStream.flush();
fileOutputStream.close();
hssfWorkbook.close();
}
Thank you :)
So somehow i found the problem:
For HSSFWorkbook I needed to set the following attribute:
hssfWorkbook.setHidden(false);
For all formats xlsx (2007) if you call that method, you will get an NotImplementedException - so you have to fix that manually... I found the solution as follows:
String workbookContent = new String(ZipFileUtils.getInnerFile(tmp, "xl/workbook.xml"), "UTF-8");
workbookContent = workbookContent.replaceFirst("visibility=\"hidden\"", "");
ZipFileUtils.replaceZippedFile(tmp, "xl/workbook.xml",
workbookContent.getBytes( "UTF-8"), new FileOutputStream(tmp2));
where tmp = My Extracted xlsx File and I save it to an new one at the momement tmp2

Save embedded files from .xls document (Apache POI)

I would like to save all attached files from an Excel (xls/HSSF) without extension.
I've been trying for a long time now, and I really don't know if this is even possible. I also tried Apache Tika, but I don't want to use Tika for this, because I need POI for other tasks, anyway.
I tried the sample code from the Busy Developers Guide, but this does not extract files in the old office format (doc, ppt, xls). And it throws an Error when trying to create new SlideShow(new HSLFSlideShow(dn, fs)) Error: (Remove argument to match HSLFSlideShow(dn))
My actual code is:
public static void saveEmbeddedXLS(InputStream fis_param, String embDIR) throws IOException, InvalidFormatException{
//HSSF - XLS
int i = 0;
System.out.println("Starting Embedded Search in xls...");
POIFSFileSystem fs = new POIFSFileSystem(fis_param);//create FileSystem using fileInputStream
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
System.out.println("Objects : "+ obj.getOLE2ClassName());//the OLE2 Class Name of the object
String oleName = obj.getOLE2ClassName();//Document Type
DirectoryNode dn = (DirectoryNode) obj.getDirectory();//get Directory Node
//Trying to create an input Stream with the embedded document, argument of createDocumentInputStream should be: String; Where/How can I get this correct parameter for the function?
InputStream is = dn.createDocumentInputStream(dn);//This line is incorrect! How can I do i correctly?
FileOutputStream fos = new FileOutputStream("embDIR" + i);//Outputfilepath + Number
IOUtils.copy(is, fos);//FileInputStream > FileOutput Stream (save File without extension)
i++;
}
}
So my simple question is:
Is it possible to save ALL attachments from an xls file without any extension (as simple as possible)? And can any one provide me a solution? Many Thanks!

iText mergeFields in PdfCopy creates invalid pdf

I am working on the task of merging some input PDF documents using iText 5.4.5. The input documents may or may not contain AcroForms and I want to merge the forms as well.
I am using the example pdf files found here and this is the code example:
public class TestForms {
#Test
public void testNoForms() throws DocumentException, IOException {
test("pdf/hello.pdf", "pdf/hello_memory.pdf");
}
#Test
public void testForms() throws DocumentException, IOException {
test("pdf/subscribe.pdf", "pdf/filled_form_1.pdf");
}
private void test(String first, String second) throws DocumentException, IOException {
OutputStream out = new FileOutputStream("/tmp/out.pdf");
InputStream stream = getClass().getClassLoader().getResourceAsStream(first);
PdfReader reader = new PdfReader(new RandomAccessFileOrArray(
new RandomAccessSourceFactory().createSource(stream)), null);
InputStream stream2 = getClass().getClassLoader().getResourceAsStream(second);
PdfReader reader2 = new PdfReader(new RandomAccessFileOrArray(
new RandomAccessSourceFactory().createSource(stream2)), null);
Document pdfDocument = new Document(reader.getPageSizeWithRotation(1));
PdfCopy pdfCopy = new PdfCopy(pdfDocument, out);
pdfCopy.setFullCompression();
pdfCopy.setCompressionLevel(PdfStream.BEST_COMPRESSION);
pdfCopy.setMergeFields();
pdfDocument.open();
pdfCopy.addDocument(reader);
pdfCopy.addDocument(reader2);
pdfCopy.close();
reader.close();
reader2.close();
}
}
With input files containing forms I get a NullPointerException with or without compression enabled.
With standard input docs, the output file is created but when I open it with Acrobat it says there was a problem (14) and no content is displayed.
With standard input docs AND compression disabled the output is created and Acrobat displays it.
Questions
I previously did this using PdfCopyFields but it's now deprecated in favor of the boolean flag mergeFields in the PdfCopy, is this correct? There's no javadoc on that flag and I couldn't find documentation about it.
Assuming the answer to the previous question is Yes, is there anything wrong with my code?
Thanks
We are using PdfCopy to merge differents files, some of files may have fields. We use the version 5.5.3.0. The code is simple and it seems to work fine, BUT sometimes the result file is impossible to print!
Our code :
Public Shared Function MergeFiles(ByVal sourceFiles As List(Of Byte())) As Byte()
Dim document As New Document()
Dim output As New MemoryStream()
Dim copy As iTextSharp.text.pdf.PdfCopy = Nothing
Dim readers As New List(Of iTextSharp.text.pdf.PdfReader)
Try
copy = New iTextSharp.text.pdf.PdfCopy(document, output)
copy.SetMergeFields()
document.Open()
For fileCounter As Integer = 0 To sourceFiles.Count - 1
Dim reader As New PdfReader(sourceFiles(fileCounter))
reader.MakeRemoteNamedDestinationsLocal()
readers.Add(reader)
copy.AddDocument(reader)
Next
Catch exception As Exception
Throw exception
Finally
If copy IsNot Nothing Then copy.Close()
document.Close()
For Each reader As PdfReader In readers
reader.Close()
Next
End Try
Return output.GetBuffer()
End Function
Your usage of PdfCopy.setMergeFields() is correct and your merging code is fine.
The issues you described are because of bugs that have crept into 5.4.5. They should be fixed in rev. 6152 and the fixes will be included in the next release.
Thanks for bringing this to our attention.
Its just to say that we have the same probleme : iText mergeFields in PdfCopy creates invalid pdf. So it is still not fixed in the version 5.5.3.0

Apache POI throwing IOException when reading XLSX workbook

I'm trying to get the following code to run and am getting an IOException:
String cellText = null;
InputStream is = null;
try {
// Find /mydata/myworkbook.xlsx
is = new FileInputStream("/mydata/myworkbook.xlsx");
is.close();
System.out.println("Found the file!");
// Read it in as a workbook and then obtain the "widgets" sheet.
Workbook wb = new XSSFWorkbook(is);
Sheet sheet = wb.getSheet("widgets");
System.out.println("Obtained the widgets sheet!");
// Grab the 2nd row in the sheet (that contains the data we want).
Row row = sheet.getRow(1);
// Grab the 7th cell/col in the row (containing the Plot 500 English Description).
Cell cell = row.getCell(6);
cellText = cell.getStringCellValue();
System.out.println("Cell text is: " + cellText);
} catch(Throwable throwable) {
System.err.println(throwable.getMessage());
} finally {
if(is != null) {
try {
is.close();
} catch(IOException ioexc) {
ioexc.printStackTrace();
}
}
}
The output from running this in Eclipse is:
Found the file!
Stream Closed
java.io.IOException: Stream Closed
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:236)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:414)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:247)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:91)
at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:51)
at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:83)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:267)
at org.apache.poi.util.PackageHelper.open(PackageHelper.java:39)
at org.apache.poi.xssf.usermodel.XSSFWorkbook.<init>(XSSFWorkbook.java:204)
at me.myorg.MyAppRunner.run(MyAppRunner.java:39)
at me.myorg.MyAppRunner.main(MyAppRunner.java:25)
The exception is coming from the line:
Workbook wb = new XSSFWorkbook(is);
According to the XSSFWorkbook Java Docs this is a valid constructor for an XSSFWorkbook object, and I don't see anything "jumping out" at me to indicate that I'm using my InputStream incorrectly. Can any POI gurus help spot where I'm going awrye? Thanks in advance.
The problem is simple:
is = new FileInputStream("/mydata/myworkbook.xlsx");
is.close();
You are closing your output stream before passing it to the constructor and it cannot be read.
Simply delete the is.close() here to fix the issue, as it will be cleaned up in the finally statement at the end.
you are closing the stream is.close();
and then using it, don't close it until you have used it.
As the others have pointed out, you are closing your InputStream which is breaking things
However, you really shouldn't be using an InputStream in the first place! POI uses less memory when given the File object directly rather than going through an InputStream.
I'd suggest you have a read through the POI FAQ on File vs InputStream, then change your code to be:
OPCPackage pkg = OPCPackage.open(new File("/mydata/myworkbook.xlsx"));
Workbook wb = new XSSFWorkbook(pkg);

Categories