Save embedded files from .xls document (Apache POI)

Save embedded files from .xls document (Apache POI) - java

I would like to save all attached files from an Excel (xls/HSSF) without extension.
I've been trying for a long time now, and I really don't know if this is even possible. I also tried Apache Tika, but I don't want to use Tika for this, because I need POI for other tasks, anyway.
I tried the sample code from the Busy Developers Guide, but this does not extract files in the old office format (doc, ppt, xls). And it throws an Error when trying to create new SlideShow(new HSLFSlideShow(dn, fs)) Error: (Remove argument to match HSLFSlideShow(dn))
My actual code is:
public static void saveEmbeddedXLS(InputStream fis_param, String embDIR) throws IOException, InvalidFormatException{
//HSSF - XLS
int i = 0;
System.out.println("Starting Embedded Search in xls...");
POIFSFileSystem fs = new POIFSFileSystem(fis_param);//create FileSystem using fileInputStream
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
System.out.println("Objects : "+ obj.getOLE2ClassName());//the OLE2 Class Name of the object
String oleName = obj.getOLE2ClassName();//Document Type
DirectoryNode dn = (DirectoryNode) obj.getDirectory();//get Directory Node
//Trying to create an input Stream with the embedded document, argument of createDocumentInputStream should be: String; Where/How can I get this correct parameter for the function?
InputStream is = dn.createDocumentInputStream(dn);//This line is incorrect! How can I do i correctly?
FileOutputStream fos = new FileOutputStream("embDIR" + i);//Outputfilepath + Number
IOUtils.copy(is, fos);//FileInputStream > FileOutput Stream (save File without extension)
i++;
}
}
So my simple question is:
Is it possible to save ALL attachments from an xls file without any extension (as simple as possible)? And can any one provide me a solution? Many Thanks!

Related

POI reading Excel file with body in String

Currenty I am trying to read an Excel file that is polled via Apache Camel (2.25.1).
This means the method gets the file contents via a String:
#Handler
public void processFile(#Body String body) {
For reading the Excel file I use Apache POI and POI-ooxml (both 4.1.2).
However, using the String directly
WorkbookFactory.create(new ByteArrayInputStream(body.getBytes(Charset.forName("UTF-8"))))
throws an "java.io.IOException: ZIP entry size is too large or invalid".
Using the String with other encodings:
WorkbookFactory.create(new ByteArrayInputStream(body.getBytes()))
throw "org.apache.poi.openxml4j.exceptions.NotOfficeXmlFileException: No valid entries or contents found, this is not a valid OOXML (Office Open XML) file".
Besides, I tried:
File file = exchange.getIn().getBody(File.class);
Workbook workbook = new XSSFWorkbook(new FileInputStream(file));
Probably because the file is read from an FTP-server, a java.io.FileNotFoundException is thrown: Invalid file path
However, the next code does work:
URL url = new URL(fileFtpPath);
URLConnection urlc = url.openConnection();
InputStream ftpIs = urlc.getInputStream();
Workbook workbook = new XSSFWorkbook(ftpIs);
But I prefer not making a connection to the FTP server myself, since Camel has already read the file and the needed Excel contents are available (in String body).
Is there any way to read the contents of the Excel file from the String with Apache POI?

I have my routes in XML, so I use groovy to process excel files, perhaps you may find it helpful
import org.apache.poi.ss.usermodel.WorkbookFactory
def workbook = WorkbookFactory.create(request.getBody(File.class))
def sheet = workbook.getSheetAt(0)
...
There is another approach usually using for large excel files where we are dealing with a stream. To go this way we should implement XSSFSheetXMLHandler.SheetContentsHandler from org.apache.poi.xssf.eventusermodel
You could find a copy of the original POI example in this SO question, for some reason it was recently deleted from poi svn. If you are interested, my groovy version looks like this
import org.apache.poi.openxml4j.opc.OPCPackage
import org.apache.poi.ooxml.util.SAXHelper
import org.apache.poi.xssf.eventusermodel.XSSFReader
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable
import org.apache.poi.hssf.usermodel.HSSFDataFormatter
import org.xml.sax.InputSource
class MyHandler implements XSSFSheetXMLHandler.SheetContentsHandler {
...
}
def pkg = OPCPackage.open(request.getBody(InputStream.class))
def xssfReader = new XSSFReader(pkg)
def sheetParser = SAXHelper.newXMLReader()
def handler = new XSSFSheetXMLHandler(xssfReader.getStylesTable(), null, new ReadOnlySharedStringsTable(pkg), MyHandler, new HSSFDataFormatter(), false)
sheetParser.setContentHandler(handler)
sheetParser.parse(new InputSource(xssfReader.getSheetsData().next()))

You can directly convert the body into InputStream and pass this into XSSFWorkbook constructor
Exchange exchange = consumerTemplate.receive("file://C:/ftp/?noop=true", pollCount);
InputStream stream = exchange.getIn().getBody(InputStream.class);
XSSFWorkbook workbook = new XSSFWorkbook(stream);
XSSFSheet sheet = workbook.getSheetAt(0);

After extracting .xls ole - file is empty with Microsoft Excel but readable with apache-poi

I have a .doc file with an embedded .xls and an embedded .doc.
I can extract both files and save it.
When I want to open the .doc - document everything is fine.
When I want to open the .xls - document it is empty, the editor opens nothing, I also dont see any empty cells nothing.
So I tried to read again with apache-poi the extracted .xls document and when I look at the Sheet-Name or Content of the cells - everything is there.
Do you have any ideas what it is?
My setup is:
apache-poi version 3.15 (I also tried some minor versions)
The word and excel files were created with office 2007.
the code - part:
POIFSFileSystem fs = new POIFSFileSystem(file);
POIOLE2TextExtractor poiole2TextExtractor = ExtractorFactory.createExtractor(fs);
POITextExtractor[] embeddedExtractors = ExtractorFactory.getEmbededDocsTextExtractors(poiole2TextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
// If the embedded object was an Excel spreadsheet.
if (textExtractor instanceof ExcelExtractor) {
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
DirectoryNode directoryNode = (DirectoryNode) excelExtractor.getRoot();
HSSFWorkbook hssfWorkbook = new HSSFWorkbook(directoryNode, true);
File tmp = new File(targetfolder, "test.xls");
FileOutputStream fileOutputStream = new FileOutputStream(tmp);
hssfWorkbook.write(fileOutputStream);
fileOutputStream.flush();
fileOutputStream.close();
hssfWorkbook.close();
}
Thank you :)

So somehow i found the problem:
For HSSFWorkbook I needed to set the following attribute:
hssfWorkbook.setHidden(false);
For all formats xlsx (2007) if you call that method, you will get an NotImplementedException - so you have to fix that manually... I found the solution as follows:
String workbookContent = new String(ZipFileUtils.getInnerFile(tmp, "xl/workbook.xml"), "UTF-8");
workbookContent = workbookContent.replaceFirst("visibility=\"hidden\"", "");
ZipFileUtils.replaceZippedFile(tmp, "xl/workbook.xml",
workbookContent.getBytes( "UTF-8"), new FileOutputStream(tmp2));
where tmp = My Extracted xlsx File and I save it to an new one at the momement tmp2

Robotium : I want to Do data driven Testing using robotium?

I am searching a way I want to test login and it is not possible to write all the function with hard code values So like we have Data driven(parametrization) in QTP etc the tool fetch data from a file and keep entering and at last all our inputs are executed ..
Is it possible to do via Roboitum ?
Please let me know for the same.

yes u can make a data excel file and u can fetch data from excel file
u can use
File storage = Environment.getExternalStorageDirectory();
String filename = "test.xls";
File myfile = new File(storage, filename);
FileInputStream fis = new FileInputStream(myfile);
Workbook ws = null;
ws = Workbook.getWorkbook(fis);
Sheet Wsheet = ws.getSheet(SheetNum);
String returnValue ;
returnValue = Wsheet.getCell(ColmnNum, RowNum).getContents();
same as how we use Excel reading in QTP , It works for me

Why do I failed to read Excel 2007 using POI?

When I try to initialize a Workbook object I always get this error:
The supplied data appears to be in the Office 2007+ XML. You are calling the part of POI that deals with OLE2 Office Documents. You need to call a different part of POI to process this data (eg XSSF instead of HSSF)
But I followed the office sample to do this, following is my code:
File inputFile = new File(inputFileName);
InputStream is = new FileInputStream(inputFile);
Workbook wb = new XSSFWorkbook(is);
Exception occurs at code line:
Workbook wb = new XSSFWorkbook(is);
Here is POI jar including:
poi-3.8-20120326.jar
poi-ooxml-3.8-20120326.jar
poi-ooxml-schemas-3.8-20120326.jar
xmlbeans-2.3.0.jar
Can any guys give me guidance? An example showing how to read a complete Excel 2007 document will be appreciated. Thanks in advance!

I assume that you have recheck that your original file is indeed in Office 2007+XML format, right?
Edit:
Then, if you are sure the format is ok, and it works for you using the WorkbookFactory.create, you can find the answer in the code of such method:
/**
* Creates the appropriate HSSFWorkbook / XSSFWorkbook from
* the given InputStream.
* Your input stream MUST either support mark/reset, or
* be wrapped as a {#link PushbackInputStream}!
*/
public static Workbook create(InputStream inp) throws IOException, InvalidFormatException {
// If clearly doesn't do mark/reset, wrap up
if(! inp.markSupported()) {
inp = new PushbackInputStream(inp, 8);
}
if(POIFSFileSystem.hasPOIFSHeader(inp)) {
return new HSSFWorkbook(inp);
}
if(POIXMLDocument.hasOOXMLHeader(inp)) {
return new XSSFWorkbook(OPCPackage.open(inp));
}
throw new IllegalArgumentException("Your InputStream was neither an OLE2 stream, nor an OOXML stream");
}
This is the bit that you were missing: new XSSFWorkbook(OPCPackage.open(inp))

Creating tables in a MS Word file using Java

I want to create a table in a Microsoft Office Word file using Java. Can anybody tell me how to do it with an example?

Have a look at Apache POI
The POI project is the master project
for developing pure Java ports of file
formats based on Microsoft's OLE 2
Compound Document Format. OLE 2
Compound Document Format is used by
Microsoft Office Documents, as well as
by programs using MFC property sets to
serialize their document objects.

I've never seen it done, and I work in Word a lot. If you really want to programatically do something in a word document then I'd advise using Microsoft's scripting language VBA which is specifically designed for this purpose. In fact, I'm working in it right now.
If you're working under Open Office then they have a very similar set of macro-powered tools for doing the same thing.

Office 2003 has an xml format, and the default document format for office 2007 is xml (zipped). So you could just generate xml from java. If you open an existing document it's not too hard too see the xml required.
Alternatively, you could use openoffice's api to generate a document, and save it as a ms-word document.

This snippet can be used to create a table dynamically in MS Word document.
WPFDocument document = new XWPFDocument();
XWPFTable tableTwo = document.createTable();
XWPFTableRow tableTwoRowOne = tableTwo.getRow(0);
tableTwoRowOne.getCell(0).setText(Knode1);
tableTwoRowOne.createCell().setText(tags.get("node1").toString());
for (int i = 1; i < nodeList.length; i++) {
String node = "node";
String nodeVal = "";
XWPFTableRow tr = null;
node = node + (i + 1);
nodeVal = tags.get(node).toString();
if (tr == null) {
tr = tableTwo.createRow();
tr.getCell(0).setText(nodeList[i]);
tr.getCell(1).setText(tags.get(node).toString());
}
}

Our feature set is to hit a button in our web app and get the page you are looking at back as a Word document. We use the docx schema for description of documents and have a bunch of Java code on the server side which does the document creation and response back to our web client. The formatting itself is done with some compiled xsl-t's from within Java to translate from our own XML persistence tier.
The docx schema is pretty hard to understand. The way we made most progress was to create template docx's in Word with exactly the formatting that we needed but with bogus content. We then fooled around with them until we understood exactly what was going on. There is a huge amount in the docx that you don't really need to worry about. When reading / translating the docx Word is pretty tolerant to a partially complete formatting schema. In fact we chose to strip out pretty much all the formatting because it also means that the user's default formatting takes precedence, which they seem to prefer. It also makes the xsl process faster and the resulting document smaller.

I manage the docx4j project
docx4j contains a class TblFactory, which creates regular tables (ie no row or column spans), with the default settings which Word 2007 would create, and with the dimensions specified by the user.
If you want a more complex table, the easiest approach is to create it in Word, then copy the resulting XML into a String in your IDE, where you can use docx4j's XmlUtils.unmarshalString to create a Tbl object from it.

Using my little zip utility, you can create docx with ease, if you know what you're doing. Word's DOCX file format is simply zip (folders with xml files). By using java zip utilities, you can modify existing docx, just the content part.
For the following sample to work, simply open Word, enter few lines, save document. Then with zip program, remove file word/document.xml (this is file where main content of the Word document is residing) from the zip. Now you have the template prepared. Save modified zip.
Here is what creation of new Word file looks:
/* docx file head */
final String DOCUMENT_XML_HEAD =
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\" ?>" +
"<w:document xmlns:wpc=\"http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas\" xmlns:mc=\"http://schemas.openxmlformats.org/markup-compatibility/2006\" xmlns:o=\"urn:schemas-microsoft-com:office:office\" xmlns:r=\"http://schemas.openxmlformats.org/officeDocument/2006/relationships\" xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\" xmlns:v=\"urn:schemas-microsoft-com:vml\" xmlns:wp14=\"http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing\" xmlns:wp=\"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing\" xmlns:w10=\"urn:schemas-microsoft-com:office:word\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" xmlns:w14=\"http://schemas.microsoft.com/office/word/2010/wordml\" xmlns:w15=\"http://schemas.microsoft.com/office/word/2012/wordml\" xmlns:wpg=\"http://schemas.microsoft.com/office/word/2010/wordprocessingGroup\" xmlns:wpi=\"http://schemas.microsoft.com/office/word/2010/wordprocessingInk\" xmlns:wne=\"http://schemas.microsoft.com/office/word/2006/wordml\" xmlns:wps=\"http://schemas.microsoft.com/office/word/2010/wordprocessingShape\" mc:Ignorable=\"w14 w15 wp14\">" +
"<w:body>";
/* docx file foot */
final String DOCUMENT_XML_FOOT =
"</w:body>" +
"</w:document>";
final ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("c:\\TEMP\\test.docx"));
final String fullDocumentXmlContent = DOCUMENT_XML_HEAD + "<w:p><w:r><w:t>Hey MS Word, hello from java.</w:t></w:r></w:p>" + DOCUMENT_XML_FOOT;
final si.gustinmi.DocxZipCreator creator = new si.gustinmi.DocxZipCreator();
// create new docx file
creator.createDocxFromExistingDocx(zos, "c:\\TEMP\\existingDocx.docx", fullDocumentXmlContent);
These are zip utilities:
package si.gustinmi;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.logging.Logger;
import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;
/**
* Creates new docx from existing one.
* #author gustinmi [at] gmail [dot] com
*/
public class DocxZipCreator {
public static final Logger log = Logger.getLogger(DocxZipCreator.class.getCanonicalName());
private static final int BUFFER_SIZE = 4096;
/** OnTheFly zip creator. Traverses through existing docx zip and creates new one simultaneousl.
* On the end, custom document.xml is inserted inside
* #param zipFilePath location of existing docx template (without word/document.xml)
* #param documentXmlContent content of the word/document.xml
* #throws IOException
*/
public void createDocxFromExistingDocx(ZipOutputStream zos, String zipFilePath, String documentXmlContent) throws IOException {
final FileInputStream fis = new FileInputStream(zipFilePath);
final ZipInputStream zipIn = new ZipInputStream(fis);
try{
log.info("Starting to create new docx zip");
ZipEntry entry = zipIn.getNextEntry();
while (entry != null) { // iterates over entries in the zip file
copyEntryfromZipToZip(zipIn, zos, entry.getName());
zipIn.closeEntry();
entry = zipIn.getNextEntry();
}
// add document.xml to existing zip
addZipEntry(documentXmlContent, zos, "word/document.xml");
}finally{
zipIn.close();
zos.close();
log.info("End of docx creation");
}
}
/** Copies sin gle entry from zip to zip */
public void copyEntryfromZipToZip(ZipInputStream is, ZipOutputStream zos, String entryName)
{
final byte [] data = new byte[BUFFER_SIZE];
int len;
int lenTotal = 0;
try {
final ZipEntry entry = new ZipEntry(entryName);
zos.putNextEntry(entry);
final CRC32 crc32 = new CRC32();
while ((len = is.read(data)) > -1){
zos.write(data, 0, len);
crc32.update(data, 0, len);
lenTotal += len;
}
entry.setSize(lenTotal);
entry.setTime(System.currentTimeMillis());
entry.setCrc(crc32.getValue());
}
catch (IOException ioe){
ioe.printStackTrace();
}
finally{
try { zos.closeEntry();} catch (IOException e) {}
}
}
/** Create new zip entry with content
* #param content content of a new zip entry
* #param zos
* #param entryName name (npr: word/document.xml)
*/
public void addZipEntry(String content, ZipOutputStream zos, String entryName)
{
final byte [] data = new byte[BUFFER_SIZE];
int len;
int lenTotal = 0;
try {
final InputStream is = new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
final ZipEntry entry = new ZipEntry(entryName);
zos.putNextEntry(entry);
final CRC32 crc32 = new CRC32();
while ((len = is.read(data)) > -1){
zos.write(data, 0, len);
crc32.update(data, 0, len);
lenTotal += len;
}
entry.setSize(lenTotal);
entry.setTime(System.currentTimeMillis());
entry.setCrc(crc32.getValue());
}
catch (IOException ioe){
ioe.printStackTrace();
}
finally{
try { zos.closeEntry();} catch (IOException e) {}
}
}
}

Office Writer would be a better tool to use than POI for your requirement.
If all you want is a simple table without too much of formatting, I would use this simple trick. Use Java to generate the table as HTML using plain old table,tr,td tags and copy the rendered HTML table into the word document ;)

Click here for a Working example with source code.
This example generates MS-Word docs from Java, based on a template concept.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Save embedded files from .xls document (Apache POI) - java

Related

POI reading Excel file with body in String

After extracting .xls ole - file is empty with Microsoft Excel but readable with apache-poi

Robotium : I want to Do data driven Testing using robotium?

Why do I failed to read Excel 2007 using POI?

Creating tables in a MS Word file using Java

Categories

Resources