How to can process xlsx file (from template) in Java?

How to can process xlsx file (from template) in Java? - java

In my application we generate Excel files using JExcel API which allows us to use XLS templates file. Now we must also manage XLSX format, but JExcel can not handle this format. What other API can be used ? I wanted to use POI but it does not take into account the templates. This forces us to change the code to fully recreated a file each time.
Thank's.

The format xlsx is just a zip of some XML files, and some other files maybe.
You could use ZipFile, but a Zip File System could be easiser to operate on single embedded XML files:
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri,
zipProperties)) {
Path mediaPath = zipFS.getPath("/word/media");
...
You can copy/rename/move and so on. Excel is a bit harder, as it uses a shared.xml with shared strings.
This approach allows to keep near to some current Excel variant, which apache POI seems to have difficulty to achieve.

Related

Generate string content for CSV with Apache Commons in Java instead of writing an actual file

in my Java program, I would like to use the Apache Commons library to generate the content of a CSV file, but not actually creating the file.
I only want to have the String content of the file, and later, write the file using that content using an existing method.
However in all examples of code, it is necessary to define in the target csv file before hand, giving its path and name, but I don't have it, at this moment of the program flow.
Is it possible to just get the String for the future csv file, and handle the real file creation independently?
Thank you.

StringWriter
You need the file when you want to write to the file.
You could pass the file's content around your programm or store it somewhere until you want to write to the file.
However if you want to write your CSV to String and not to a File you could try to use a StringWriter instead of a FileWriter.
Something like (not compiled, might not be complete)
StringWriter stringWriter = new StringWriter();
CSVPrinter csvPrinter = new CSVPrinter(stringWriter , CSVFormat.DEFAULT
.withHeader("Header1", "Header2"));
csvPrinter.printRecord("abc", "ghf");
String csvString = stringWriter.toString();

Error when parsing an embedded .xlsx file from a .ppt using apache-poi. The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry

I am facing an issue when using apache poi to extract an embedded .xlsx files from a .ppt file. It would be really great if somebody could help me out.
The subject of the problem:
Problem trying to solve: Extracting a ".xlsx" file embedded inside a ".ppt".
I am currently using apache-poi.
It seems that when I try to do it using hslfSlideShow.getEmbeddedObjects(), I get the xlsx object just fine but when I try converting it to the XLSFWorkbook object using say WorkbookFactory.create(inputStream), it threw an error saying
java.lang.IllegalArgumentException: The supplied POIFSFileSystem does not contain a BIFF8 'Workbook' entry. Is it really an excel file? Had: [OlePres000, Ole, CompObj, Package]
at org.apache.poi.hssf.usermodel.HSSFWorkbook.getWorkbookDirEntryName(HSSFWorkbook.java:286)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:326)
at org.apache.poi.hssf.usermodel.HSSFWorkbookFactory.createWorkbook(HSSFWorkbookFactory.java:64)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:167)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:112)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:253)
at org.apache.poi.ss.usermodel.WorkbookFactory.create(WorkbookFactory.java:221)
Interestingly it is calling HSSFWorkbookFactory even though its an xlsx file.
And no the xlsx file is not corrupted/password-protected. I can open it just fine.
Also, it works fine if I try parsing the .xlsx file without embedding it in the .ppt.
And the parsing works fine when I embed it in a .pptx file and call methods such as xmlSlideShow.getAllEmbeddedParts() to get the embedded objects from .pptx.

Promoting some comments and investigation to an answer...
This was a limitation in older version of Apache POI, but was fixed in July in r1880164.
For backwards-compatibility reasons, PowerPoint will often (but not always...) write embedded OOXML resources wrapped in an intermediate OLE2 layer. This has the advantage that tools/programs which expect embedded office documents to be something like a xls / doc to cope, but at the expense of another layer of wrapping.
Newer versions of Apache POI (5.0 should be the first released one with the fix in) have support in WorkbookFactory for receiving an OLE2 wrapper like this, pulling out the underlying xlsx stream and handing that off to XSSFWorkbook. (Older versions did this for OLE2-based password-protected xlsx files, but not their unencrypted cousins)
For now, if you're stuck on an affected POI version, the code you'll want is something like this (largely taken from the unit test verifying support!):
POIFSFileSystem fs = new POIFSFileSystem(data.getInputStream());
if(fs.getRoot().hasEntry("Package")) {
DocumentInputStream dis = new DocumentInputStream((DocumentEntry)fs.getRoot().getEntry("Package"));
try (OPCPackage pkg = OPCPackage.open(dis)) {
XSSFWorkbook wb = new XSSFWorkbook(pkg);
handleWorkbook(wb);
wb.close();
}
} else {
try (HSSFWorkbook wb = new HSSFWorkbook(fs)) {
handleWorkbook(wb);
}
}

Write datasets to combined xls file

I have got a Scala script which writes a great deal of csv files with e.g. file names: "dog-species", "dog-weight", "cat-species", "cat-params" and so one. Would like to change the behaviour of the script to combine a datasets into bigger xls files with additional "info" sheet, which will contains some introductory details. Expected result:
file "dog.xls" with sheets: "info", "species", "weight", ...
file "cat.xls" with sheets: "info", "species", "params", ...
So my approach would be to use a conditional logic: when I proceed "dog-species" dataset, i check if the file "dog.xls" exists. If file exists I append the file "dog.xls" with new "species" sheets. If file doesn't exists I create a new "dog.xls" file with the "info" sheet and after that append with the "species" sheets.
Any idea about the possible Scala/Java libraries or ways to deal with the issue? I use Scala 2.10.5, Java 1.8, Spark 1.6.3.
Thanks.

In Spark you can write CSV-s but not XLS files.
I suggest that try to write CSV-s than merge them by your logic with https://poi.apache.org/
If you don't have huge datasets which I think it is the case (you don't store huge dataset in XLS) than you can just collect them and write the XLS.

how to know whether a file is .docx or .doc format from Apache POI

I know we can get it done by extension or by mime type, do we have any other way through which we can get the idea of type of file whether it is .docx or .doc.

If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)

You could use Apache Tika for content Detection. But you should been aware that this is a huge framework (many required dependencies) for such a small task.

There is a way, no strightforward though. But with Apache POI, you can locate it.
Try to read a .docx file using HWPFDocument Class. It would give you the following error
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied
data appears to be in the Office 2007+ XML. You are calling the part
of POI that deals with OLE2 Office Documents. You need to call a
different part of POI to process this data (eg XSSF instead of HSSF)
String filePath = "C:\\XXXX\XXXX.docx";
FileInputStream inStream;
try {
inStream = new FileInputStream(new File(filePath));
HWPFDocument doc = new HWPFDocument(inStream);
WordExtractor wordExtractor = new WordExtractor(doc);
System.out.println("Getting words"+wordExtractor.getText());
} catch (Exception e) {
System.out.print("Its not a .doc format");
}
.docx can be read using XWPFDocument Class.

Why dont you use Apache Tika:
File file = new File('File Here');
Tika tika = new Tika();
String filetype = tika.detect(file);
System.out.println(filetype);

Assuming you're using Apache POI, you have a few options.
One is to grab the first few bytes of the file, and ask POIFSFileSystem with the hasPOIFSHeader(byte) method. If you have a stream that supports mark/reset, you can instead use POIFSFileSystem.hasPOIFSHeader(InputStream). If those return true then try to open it as a .doc with HWPF, otherwise try as .docx with XWPF
Otherwise, if you prefer a try/catch way, try to open it with POIFSFileSystem and catch OfficeXmlFileException - if it opens fine it's .doc, if you get the exception it's .docx
If you look at the source code for WorkbookFactory you'll see the first pattern in use, you can copy a similar set of logic form that

Apache POI HSSF XLS reading error

Using the following code while reading in a .xls file, where s is the file directory:
InputStream input = new FileInputStream(s);
Workbook wbs = new HSSFWorkbook(input);
I get the following error message:
Exception in thread "main" java.io.IOException: Invalid header signature; read 0x0010000000060809, expected 0xE11AB1A1E011CFD0
I need a program that is able to read in either XLSX or XLS, and using the exact same code just adjusted for XSSF it has no problem at all reading in the XLSX file.

The Exception you're getting is one telling you that the file you're supplying isn't a valid Excel binary file, at least not a valid Excel file produced since about 1990. The exception you're getting tells you what POI expects, and that it found something else instead which wasn't a valid .xls file, and wasn't anything else POI can detect.
One thing to be aware of is that Excel opens a wide variety of different file formats, including .csv and .html. It's also not very picky about the file extension, so will happily open a CSV file that has been renamed to a .xls one. However, since renaming a .csv to a .xls doesn't magically change the format, POI still can't open it!
.
From the exception, I can tell what's happening, and I can also tell you're using an ancient version of Apache POI! A header signature of 0x0010000000060809 corresponds to the Excel 4 file format, from about 25 years ago! If you use a more recent version of Apache POI, it'll give you a helpful error message telling you that the file supplied is an old and largely unsupported Excel file. New versions of POI do include the OldExcelExtractor tool which can pull out some information from those ancient formats.
Otherwise, as with all exceptions of this type, try opening the file in Excel and doing a save-as. That will give you an idea of what the file currently is (eg .html saved as .xls, .csv saved as .xls etc), and will also let you re-save it as a proper .xls file for POI to load and work with.

If the file is in xlsx format instead of xls you might get this error. I would try using the generic Workbook object (Also called the SS Usermodel)
Check out the Workbook interface and the WorkbookFactory object. The factory should be able to create a generic Workbook for you out of either xlsx or xls.
I thought I had a good tutorial on this, but I can't seem to find it. I'll keep looking though.
Edit
I found this little tiny snippet from Apache's site about reading and rewriting using the SS Usermodel.
I hope this helps!

Invalid header signature; read 0x342E312D46445025, expected 0xE11AB1A1E011CFD0
Well I got this error when I uploaded corrupted xls/xlsx file(to upload corrupt file I renamed sample.pdf to sample.xls). Add validation like :
Workbook wbs = null;
try {
InputStream input = new FileInputStream(s);
wbs = new HSSFWorkbook(input);
} catch(IOException e) {
// log "file is corrupted", show error message to user
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.