I have to write a very large XLS file, I have tried Apache POI but it simply takes up too much memory for me to use.
I had a quick look through StackOverflow and I noticed some references to the Cocoon project and, specifically the HSSFSerializer. It seems that this is a more memory-efficient way to write XLS files to disk (from what I've read, please correct me if I'm wrong!).
I'm interested in the use case described here: http://cocoon.apache.org/2.1/userdocs/xls-serializer.html . I've already written the code to write out the file in the Gnumeric format, but I can't seem to find how to invoke the HSSFSerializer to convert it to XLS.
On further reading it seems like the Cocoon project is a web framework of sorts. I may very well be barking up the wrong tree, but:
Could you provide an example of reading in a file, running the HSSFSerializer on it and writing that output to another file? It's not clear how to do so from the documentation.
My friend, HSSF serializer is part of POI. You are just setting certain attributes in the xml to be serialized (but you need a whole process to create it). Also, setting a whole pipeline using this framework just to create a XLS seems odd as it changes the app's architecture. ¿Is that your decision?
From the docs:
An alternate way of generating a spreadsheet is via the Cocoon
serializer (yet you'll still be using HSSF indirectly). With Cocoon
you can serialize any XML datasource (which might be a ESQL page
outputting in SQL for instance) by simply applying the stylesheet and
designating the serializer.
If memory is an issue, try XSSF or SXSSF in POI.
I don't know if by "XLS" you mean a specific, prior to Office 2007, version of this "Horrible SpreadSheet Format" (which is what HSSF stands for), or just anything you can open with a recent version of MS Office, OpenOffice, ...
So depending on your client requirements (i.e. those that will open your Excel file), another option might be available : generating a .XLSX file.
It comes down to producing an XML file in the proper grammar, which seems to be fit to your situation, as you seem to have already done that with the Gnumeric XML-based file format without technical trouble, and without hitting memory-effisciency issues.
Please note other XML-based spreadsheet formats exist, that Excel and other clients would be able to use. You might want to dig into the open document file formats.
As to wether to use Apache Cocoon or something else:
Cocoon can sure host the XSL processing ; batch (Cocoon CLI) processing is available if you require Cocoon, but require it not to run as a webapp (though as far as I remember, CLI feature was broken in the lastest builds of the 2.1 series) ; and Cocoon comes with a load of features and technologies that could address further requirements.
Cocoon might be overkill if it just comes down to running an XSL transformation, for which there is a bunch of well-known, lighter tools you can pick from.
Related
I'm looking for some info on how to create a .vsdx file in Java without any commercial libraries. According to other questions it seems to be pretty tough.
As a source we have a different, probably unknown file format called .epml that contains graphical information of EPCs which we should be able to convert to a .xml file. As far as I understand the .vsdx format so far, that's one of many files in the unzipped .vsdx required. I'd be glad if anyone could tell me about my options how to implement/create all the other files.
EDIT: The goal here is to be able to convert the graphic information of the .epml file so Visio is able to read & display it as in the source. Therefore, it doesn't have to be a .vsdx file if there are other possible options.
Thanks!
EPML is a not an unknown format, it is an interchange format for EPC tools. Just try to google it :)
I would suggest you convert your .epml files to .svg (there are free open source converters available, like epml2svg). Visio can read and show .svg files. Means - writing code does not seem to be required to achieve your goal (to convert .epml files to something Visio can show). AFAR there is online version of the tool as well - you upload EPML file, get back SVG, and just open it in Visio - that's it.
Side note - there are companies, like bpm-x for example, specializing in BPM tool-to-tool diagram conversion. Maybe they already have a solution for your original tool.
The .VSDX file is "office xml" format, that is also open and documented. But it's pretty tough to generate file from scratch, you are right. So in principle you could start with any code that is capable of handling open xml packages. Microsoft has OpenXML SDK, but that's .NET (MSDN HOWTO assumes you are using .NET, but explains basics of what the open xml package consists of)
AFAIK, for java, there are no open source visio libraries you could use. Java and Visio seem to live in parallel universes. The only viable commercial option I've heard of seem to be Aspose.
Interesting - whilst I cannot give a final answer, here are some thoughts:
Question 1: Why would you want to avoid commercial tools, when the final result file will require some - namely "Visio"?
1) Creating Visio files from XML:
Create template XMLs from a VSDX. Identify the files, that you need to edit. From what I've seen, these should be the masters and the pages files. Being able to make an XML from EPML, you should also know how to adapt it to a new structure.
This solution is probably by far the most tedious and less flexible.
2) Use Visio automation:
Presuming that the final document will need more than just graphics, namely shape data as well, an easier solution would consist of creating the graphics first
a) as SVG and import into Visio
b) even easier - automated drawing by Visio's automation capabilities (VBA, .Net, ...). The shapes to drop would already have been prepared as masters will all the relevant data and behaviour settings.
Then you would populate the data by means of one of the many data linking features (Wizard, Standard data linking, ODBC connections, etc.)
I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.
While going thru tika, I came across POI API and found more friendly to use it.
we may have requirement to parse PDF file in further.
I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.
Thanks,
Krishna
Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content
Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.
If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.
Apache POI is full blown parser/writer for most of the Microsoft Documents. It supports both newly introduced 2007 (XSSF) format and Microsoft 2003 file formats (HSSF). Apache POI provides two level of API for parsing and generating Microsoft files. One that is higher level API that is bit memory intensive which reads the whole file and keeps in the memory something similar to DOM parsing in XML and lower level API for memory intensive use which is similar to SAX/StAX parsing.
On the other hand Apache Tika is content analysis tool which I guess only supports Microsoft Excel and lot of other extraction components. There is no support for writing new files or generating content from Tika, anyway that is not the their use case at all.
So, you have to choose depending on your need.
I have an Excel workbook with multiple sheets. Each sheet holds a table, the different tables have different formats.
I need to read the entire workbook into my Java program. The most convenient method IMHO is to export the entire data into a single XML and parse it (using simpleXML or some other compatible parser).
I have found no method for applying a schema to multiple sheets of a workbook, only to a single sheet. Is it possible? If so, how?
When it comes to convenience, there are many factors that influence or define it. For example, it depends if this is an ongoing thing, or if it needs to be integrated into a process, etc.
Before recommending a solution as suggested, I would try to convince you to take a look at Apache's POI (the Java API for Microsoft Documents), specifically the Excel API. It gives you a Java API for your Java program that should allow you to read what you need pretty easily. It would be a one stop shop kind of thing.
Another approach might be to use Jdbc to Odbc and access the Excel via JDBC API (JDBC to ODBC provider). I can't tell from details in your question if your deployment model would allow for this (e.g. if you run on a platform that doesn't have an ODBC provider for Excel files), but on Windows for sure is an option; also, many places on internet detailing this approach.
If you insist on going down the XML export way, QTAssistant (I am associated with it) has a comprehensive solution (XML Builder) for generating XML from any supported relational data source. It provides a GUI and a command line. In your case it would need the XLS, an XSD which describes the XML you want to get out and a mapping file (basically another XML file) to create the XML you need. In general this feature is largely used to convert test data into XML for Web service calls, so it is geared towards a certain interaction pattern between the user, the tool, and the XML generation activities. If you're interested in more details, let me know.
I am able to create .mpx file by using mpxj library in java.
I need write ( create ) .mpp file in java can any one suggest me please.
I maintain MPXJ, and the short answer to your enquiry is that, at present, MPXJ does not write MPP files.
The main reason for this is simply that despite the effort which has gone into understanding the MPP file structure, there is still a great deal of it which is not well understood, hence it is difficult to reliably generate. The other issue is that even if I was to produce some code which could generate an MPP file, the features it could write to that file are likely to lag behind what MPXJ supports in the MSPDI file format, again due to my incomplete understanding of the MPP format.
My suspicion is that the next version of MS project (project 15? Project 2013?) may probably offer a ".mppx" file format, similar to the ".docx" etc formats used by other applications in the MS Office suite. This will be XML-based and will be more straightforward to generate than the binary MPP file format currently is... let's see what Microsoft come up with!
Jon
Visit http://www.mpxj.org/faq/
Can I use MPXJ to write MPP files?
Not at present. Although it is technically feasible to generate an MPP file, the knowledge we have of the file structure is still relatively incomplete, despite the amount of data we are able to correctly extract. It is therefore likely to take a considerable amount of development effort to make this work, and it is conceivable that we will not be ablet to write the full set of attributes that MPXJ supports back into the MPP file - simply because we don't understand the format well enough. You are therefore probably better off using MSPDI which does support the full range of data items present in an MPP file.
You can
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
But this not free
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
I think by "mpp" you probably mean "Microsoft PowerPoint", correct?
Q: Why do you think MPXJ (Microsoft Project Exchange/Java) can't do this?
http://www.mpxj.org/
Welcome to MPXJ! This library provides a set of facilities to allow
project information to be manipulated in Java and .Net. MPXJ supports
a range of data formats: Microsoft Project Exchange (MPX), Microsoft
Project (MPP,MPT), Microsoft Project Data Interchange (MSPDI XML),
Microsoft Project Database (MPD), Planner (XML), Primavera (PM XML,
XER, and database), and Asta Powerproject (PP, MDB).
Hi I'm looking to parse spreadsheets (xls/ods) in Groovy. I have been using the Roo library for Ruby and was looking to try the same tasks in Groovy, as Java is already installed on a development server I use, and I would like to keep the number of technologies on the server to a simple core few.
I am aware that the ods format is zipped XML, and so can be parsed as such, but I would like to process the file using spreadsheet concepts, not XML concepts.
The ability to process xls files is not of major importance, but would save me having to save multiple xls files to ods (as this is for parsing data from clients).
Thanks
I would suggest Apache POI for access to .xls files.
I've never had to work with the .ods format, so no information on that one.
There's also JExcelAPI, which has a nice, clean, simple interface (for the most part).
Can't help you with ODS Files though.
How about looking at 'odftoolkit' ? http://odftoolkit.openoffice.org/
Groovy in Action has a chapter named "Groovy on Windows" that discusses using Scriptom, a Groovy/COM bridge (using JACOB under the covers), to access several Windows apps including Excel.
For OpenOffice, you can use ODF Toolkit, as Amit pointed out.
I second jdmichal's vote for Apache POI. I have selected it as our library of choose to handle Excel file input (.XLS). The project is also working on the .XLSX file format if you ever decide you want to support that. Based on your specifications, I don't think you want to get into converting things into CSV and it seems like you have established input and output paths. For anyone who hasn't had the joy of dealing with CSV to Excel conversion, it can get a bit dicey. I have spent hours dealing with issues created by Excel converting string data to numeric data. You can see other testimonies to this effect on the POI Case Studies page. Beyond these issues, I simply don't want to personally have to handle these inputs. I'd rather invest the programming effort and streamline the workflow for the future.
I too have not dealt with ODF and have no plans to support it in my current project. You might want to check out the OpenOffice.org ODF Toolkit Project.
Good luck and have fun,
- D.
I suggest you to take a look at SimpleXlsBuilder and SimpleXlsSlurper, both are based on apache POI and can fit your basic needs for reading from and writing to Excel 97 spreadsheets in a concise way.
If your spreadsheets are simple enught - without charts and other embedded contents - you should simply convert the spreadsheet to CSV.
Pros:
Both xls and ods will produce the same CSV - You'll have to handle just one input type.
You won't have to mess with new versions of (Open) Office.
Handling plaintext is always more fun than other obscure formats.
Cons:
One that I can think of - finding a reliable converter from xls and odf to csv. Shouldn't be too hard - OpenOffice has a built in one.
A couple things:
1) I agree that using a CSV format can simplify some of the development work. OpenCSV can help with processing CSV files. There are other good CSV parsers for Java out there. Just remember that anything that's available for Java can be used by Groovy due to Groovy's unparalleled integration with Java.
2) I know you said you wanted to avoid handling XML, but Groovy makes XML processing exceedingly simple.