I have got a Scala script which writes a great deal of csv files with e.g. file names: "dog-species", "dog-weight", "cat-species", "cat-params" and so one. Would like to change the behaviour of the script to combine a datasets into bigger xls files with additional "info" sheet, which will contains some introductory details. Expected result:
file "dog.xls" with sheets: "info", "species", "weight", ...
file "cat.xls" with sheets: "info", "species", "params", ...
So my approach would be to use a conditional logic: when I proceed "dog-species" dataset, i check if the file "dog.xls" exists. If file exists I append the file "dog.xls" with new "species" sheets. If file doesn't exists I create a new "dog.xls" file with the "info" sheet and after that append with the "species" sheets.
Any idea about the possible Scala/Java libraries or ways to deal with the issue? I use Scala 2.10.5, Java 1.8, Spark 1.6.3.
Thanks.
In Spark you can write CSV-s but not XLS files.
I suggest that try to write CSV-s than merge them by your logic with https://poi.apache.org/
If you don't have huge datasets which I think it is the case (you don't store huge dataset in XLS) than you can just collect them and write the XLS.
Related
I have the issue that Apache POI "corrupted" a xlsm / xlsx file by just reading and writing it (e.g. with the following code)
public class Snippet {
public static void main(String[] args) throws Exception {
String str1 = "c:/tmp/spreadsheet.xlsm";
String str2 = "c:/tmp/spreadsheet_poi.xlsm";
// open file
XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(new File(str1)));
// save file
FileOutputStream out = new FileOutputStream(str2);
wb.write(out);
wb.close();
out.close();
}
}
Once you open the spreadsheet_poi.xlsm in Excel you'll get an error like the following
"We found a problem with some content in xxx. Do you want us to try to recover as much as we can..."?
If you say yes you'll end up with a log which could look like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error145040_01.xml</logFileName>
<summary>Errors were detected in file 'C:\tmp\spreadsheet_poi.xlsm'</summary>
<repairedParts>
<repairedPart>Repaired Part: /xl/worksheets/sheet4.xml part with XML error. Load error. Line 2, column 0.</repairedPart>
<repairedPart>Repaired Part: /xl/worksheets/sheet5.xml part with XML error. Load error. Line 2, column 0.</repairedPart>
<repairedPart>Repaired Part: /xl/worksheets/sheet8.xml part with XML error. Load error. Line 2, column 0.</repairedPart>
</repairedParts>
</recoveryLog>
Whats the best approach to debug the issue in more detail (e.g. find out what makes poi to "corrupt" the file?
Eventually I found how that the best approach for debugging this are two things
open the affected workbook (e.g. with 7zip and format the affected sheets with an xml editor (e.g. Notepad++ > Plugins > XML Tools > Pretty print (XML only - with line breaks). After saving the files and updating the xlsm file you'll get the "real" line numbers in the Excel error log. Alternative option (which I haven't tried but should work according to the POI mailing liste: use OOXMLPrettyPrint (https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/ooxml/dev/) to format the file and then reopen it it in excel.
if the real line numbers not already help compare the sheet xml files of the original xlsx file and the one saved by poi. You'll notice that there are differences in regards to the attributes and also the order is different. In order to properly compare I used Beyond Compare with "Additional File Formats" (see https://weblogs.asp.net/lorenh/comparing-xml-files-with-beyond-compare-3-brilliant for more information). Maybe there is another diff tool that is equally good.
In my case the problem was that poi somehow changed the dimension setting from
<dimension ref="A1:XFD147"/>
to
<dimension ref="A1:XFE147"/>
(with XFE beeing a non existing column). I fixed it by removing those many empty columns in the original xlsx file.
My professor said: "How does the mathematician find the lion in the desert?" - "First cuts the desert into two halves, finds out where is the lion, then repeats it until the lion is caught".
So, try to remove features from the Excel files, try different versions, until you find the root cause. There may be multiple causes, though.
We have generated a .csv file using Open CSV library in java. Our requirement is to change the extension from .csv to .xls .
When we changed the extension blindly(in java code) by renaming the file name to .xls in java, the data is not aligned or formatted properly.
In .csv file when we open it with excel values inside table are aligned properly. But when we change to .xls and open it, everything is comma separated values and populated inside one column i.e., the values in table are not populated under respective column. Please find below the screenshot.
enter image description here
So why not open the .csv file in excel and then do a "Save As" and for file type select excel spreadsheet.
That is the part you are missing. Changing the extension does not change the file type. You are just changing the way most computers see the file. Open up an real excel spreadsheet in a text editor and I assure you will see alot more than comma separated values.
You should look for vbs scripts, I know that I'm doing the opposite (xlsx to csv) using one of those script that I found here so I guess that it should be possible to do the opposite, I hope that you find your solution there !
Here is a script to convert a xlsx to a csv :
if WScript.Arguments.Count < 2 Then
WScript.Echo "Error! Please specify the source path and the destination. Usage: XlsToCsv SourcePath.xls Destination.csv"
Wscript.Quit
End If
Dim oExcel
Set oExcel = CreateObject("Excel.Application")
Dim oBook
Set oBook = oExcel.Workbooks.Open(Wscript.Arguments.Item(0))
oBook.SaveAs WScript.Arguments.Item(1), 6
oBook.Close False
oExcel.Quit
I think you need to use "Apache POI - the Java API" for .xls
I would like to create a copy of word or excel file using poi.
I know that poi is also used when reading a word or excel file. Reading means not only values but also attribute such as font size or table color and backgroud colors for each cells. Reading values and attribute of the xlsx or docx document, I want to make a copy of the word or Excel document as it is. Is it possible that the related source is open at open source on the any site?
read Apache POI or docx4j for dealing with docx documents
you can find the techniques related to adding text into document you can found out on https://www.slideshare.net/plutext/document-generation-2012osdcsydney
use POI's HWPF support. this is often enclosed in docx4j as a dependency. however its not an excellent approach, since it does not convert the doc to docx4j's internal representation:- you are kind of stuck in HWPF land
use JODConverter to convert the doc to a docx, and if necessary, back again. this is often the simplest .
To open an excel from one file, and save it to another file I use this code.
//open source excel
InputStream template = new FileInputStream("C:\\source excel path\\input.xlsx");
Workbook wb = WorkbookFactory.create(template);
//Saving excel to a different location or filename.
FileOutputStream out = new FileOutputStream("C:\\path to copy excel to\\output.xlsx");
wb.write(out);
wb.close();
out.close();
template.close();
In my application we generate Excel files using JExcel API which allows us to use XLS templates file. Now we must also manage XLSX format, but JExcel can not handle this format. What other API can be used ? I wanted to use POI but it does not take into account the templates. This forces us to change the code to fully recreated a file each time.
Thank's.
The format xlsx is just a zip of some XML files, and some other files maybe.
You could use ZipFile, but a Zip File System could be easiser to operate on single embedded XML files:
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri,
zipProperties)) {
Path mediaPath = zipFS.getPath("/word/media");
...
You can copy/rename/move and so on. Excel is a bit harder, as it uses a shared.xml with shared strings.
This approach allows to keep near to some current Excel variant, which apache POI seems to have difficulty to achieve.
Using the following code while reading in a .xls file, where s is the file directory:
InputStream input = new FileInputStream(s);
Workbook wbs = new HSSFWorkbook(input);
I get the following error message:
Exception in thread "main" java.io.IOException: Invalid header signature; read 0x0010000000060809, expected 0xE11AB1A1E011CFD0
I need a program that is able to read in either XLSX or XLS, and using the exact same code just adjusted for XSSF it has no problem at all reading in the XLSX file.
The Exception you're getting is one telling you that the file you're supplying isn't a valid Excel binary file, at least not a valid Excel file produced since about 1990. The exception you're getting tells you what POI expects, and that it found something else instead which wasn't a valid .xls file, and wasn't anything else POI can detect.
One thing to be aware of is that Excel opens a wide variety of different file formats, including .csv and .html. It's also not very picky about the file extension, so will happily open a CSV file that has been renamed to a .xls one. However, since renaming a .csv to a .xls doesn't magically change the format, POI still can't open it!
.
From the exception, I can tell what's happening, and I can also tell you're using an ancient version of Apache POI! A header signature of 0x0010000000060809 corresponds to the Excel 4 file format, from about 25 years ago! If you use a more recent version of Apache POI, it'll give you a helpful error message telling you that the file supplied is an old and largely unsupported Excel file. New versions of POI do include the OldExcelExtractor tool which can pull out some information from those ancient formats.
Otherwise, as with all exceptions of this type, try opening the file in Excel and doing a save-as. That will give you an idea of what the file currently is (eg .html saved as .xls, .csv saved as .xls etc), and will also let you re-save it as a proper .xls file for POI to load and work with.
If the file is in xlsx format instead of xls you might get this error. I would try using the generic Workbook object (Also called the SS Usermodel)
Check out the Workbook interface and the WorkbookFactory object. The factory should be able to create a generic Workbook for you out of either xlsx or xls.
I thought I had a good tutorial on this, but I can't seem to find it. I'll keep looking though.
Edit
I found this little tiny snippet from Apache's site about reading and rewriting using the SS Usermodel.
I hope this helps!
Invalid header signature; read 0x342E312D46445025, expected 0xE11AB1A1E011CFD0
Well I got this error when I uploaded corrupted xls/xlsx file(to upload corrupt file I renamed sample.pdf to sample.xls). Add validation like :
Workbook wbs = null;
try {
InputStream input = new FileInputStream(s);
wbs = new HSSFWorkbook(input);
} catch(IOException e) {
// log "file is corrupted", show error message to user
}