I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.
While going thru tika, I came across POI API and found more friendly to use it.
we may have requirement to parse PDF file in further.
I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.
Thanks,
Krishna
Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content
Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.
If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.
Apache POI is full blown parser/writer for most of the Microsoft Documents. It supports both newly introduced 2007 (XSSF) format and Microsoft 2003 file formats (HSSF). Apache POI provides two level of API for parsing and generating Microsoft files. One that is higher level API that is bit memory intensive which reads the whole file and keeps in the memory something similar to DOM parsing in XML and lower level API for memory intensive use which is similar to SAX/StAX parsing.
On the other hand Apache Tika is content analysis tool which I guess only supports Microsoft Excel and lot of other extraction components. There is no support for writing new files or generating content from Tika, anyway that is not the their use case at all.
So, you have to choose depending on your need.
Related
I recently came across Apache Tika, a beautiful toolkit which handles files of several types to extract the text (and some other information such as metadata).
The problem which I am facing is that given a document (in some format such as PDF, DOC, XLS and so on), I need to extract the text, modify some of it, and re-build the document in its original format (with the modified text). To my knowledge, Tika provides the facility of extraction of text, but does not 'stitch' modified documents back.
I feel that there are some libraries which do this for specific file types, but I am not aware of any toolkit similar to Tika, which provides an end-to-end solution for me by handling all the file types supported by Tika. I am also not sure if Tika itself can do this for me.
If someone knows anything of this sort, please let me know. I am looking for a library written in Java.
Regards,
Salil
EDIT: coderanch.com/how-to/java/AccessingFileFormats has several toolkits lister, but I would appreciate something that wraps all the formats supported by Tika comprehensively.
Apache POI
Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.
OLE2 files include most Microsoft Office files such as XLS, DOC, and PPT as well as MFC serialization API based file formats. The project provides APIs for the OLE2 Filesystem (POIFS) and OLE2 Document Properties (HPSF).
Office OpenXML Format is the new standards based XML file format found in Microsoft Office 2007 and 2008. This includes XLSX, DOCX and PPTX.
Eclipse Birt
Q: What report output formats does BIRT support?
Release 2.1 supports HTML, Paginated HTML and PDF.
Release 2.2 support HTML, Paginated HTML, PDF, WORD, XLS, and PostScript
It appears that there are no better toolkits as mentioned here. The only way out would be to write your own wrapper for one or more of these toolkits to get the work done. It would have been great if Tika itself provided that facility, but that unfortunately does not seem to be the case.
Is it possible to convert from MS office file formats using Apache PDFBox (the documentation isn't clear about this, and the javadoc seems to indicate no such capability exists), or would I need to do some tedious conversions with Apache POI?
The reason I'm asking is the answer to this StackOverflow question:
https://stackoverflow.com/questions/10861227/convert-ms-office-to-pdf-in-java
I imagine I'll need to use Apache POI, but I wanted to clarify.
In order to do this conversion, you will need MS Office, or perhaps Google Drive. PDFBox does not convert from anything to PDF or vice versa -- it simply reads and writes PDF files. Apache POI will not do that type of conversion either -- it simply reads and writes MS Office files. Specifically, it does not render them. You could implement a rendering engine for each type of Office file yourself, but that would be a gargantuan task to say the least.
Take a look at https://angelozerr.wordpress.com/2012/12/06/how-to-convert-docxodt-to-pdfhtml-with-java/.
One of possible options it mentions is XWPFConverterPDFViaIText:
org.apache.poi.xwpf.converter.pdf provides the DOCX 2 Pdf converter
based on Apache POI XWPF and iText.
You can test this converter with the REST Converter service
http://xdocreport-converter.opensagres.cloudbees.net/
I have to write a very large XLS file, I have tried Apache POI but it simply takes up too much memory for me to use.
I had a quick look through StackOverflow and I noticed some references to the Cocoon project and, specifically the HSSFSerializer. It seems that this is a more memory-efficient way to write XLS files to disk (from what I've read, please correct me if I'm wrong!).
I'm interested in the use case described here: http://cocoon.apache.org/2.1/userdocs/xls-serializer.html . I've already written the code to write out the file in the Gnumeric format, but I can't seem to find how to invoke the HSSFSerializer to convert it to XLS.
On further reading it seems like the Cocoon project is a web framework of sorts. I may very well be barking up the wrong tree, but:
Could you provide an example of reading in a file, running the HSSFSerializer on it and writing that output to another file? It's not clear how to do so from the documentation.
My friend, HSSF serializer is part of POI. You are just setting certain attributes in the xml to be serialized (but you need a whole process to create it). Also, setting a whole pipeline using this framework just to create a XLS seems odd as it changes the app's architecture. ¿Is that your decision?
From the docs:
An alternate way of generating a spreadsheet is via the Cocoon
serializer (yet you'll still be using HSSF indirectly). With Cocoon
you can serialize any XML datasource (which might be a ESQL page
outputting in SQL for instance) by simply applying the stylesheet and
designating the serializer.
If memory is an issue, try XSSF or SXSSF in POI.
I don't know if by "XLS" you mean a specific, prior to Office 2007, version of this "Horrible SpreadSheet Format" (which is what HSSF stands for), or just anything you can open with a recent version of MS Office, OpenOffice, ...
So depending on your client requirements (i.e. those that will open your Excel file), another option might be available : generating a .XLSX file.
It comes down to producing an XML file in the proper grammar, which seems to be fit to your situation, as you seem to have already done that with the Gnumeric XML-based file format without technical trouble, and without hitting memory-effisciency issues.
Please note other XML-based spreadsheet formats exist, that Excel and other clients would be able to use. You might want to dig into the open document file formats.
As to wether to use Apache Cocoon or something else:
Cocoon can sure host the XSL processing ; batch (Cocoon CLI) processing is available if you require Cocoon, but require it not to run as a webapp (though as far as I remember, CLI feature was broken in the lastest builds of the 2.1 series) ; and Cocoon comes with a load of features and technologies that could address further requirements.
Cocoon might be overkill if it just comes down to running an XSL transformation, for which there is a bunch of well-known, lighter tools you can pick from.
I would like to parse some legal documents with a Java library into pieces of text that represent headers, paragraphs etc. Legal documents are usually well-structured, so I would like to use something a bit easier than JavaCC (or other parser generators). Are there any which would allow to (nearly) automatically detect such a structure?
Thanks.
I think there is no tool that can "nearly automatically" extract such structures. If it is realy easy to extract the structure you would not need any tool, you can easely code it yourself. If it is not so easy you need a tool that is powerfull enough (JavaCC, ANTLR ...).
I think parsing the text yourself with custom code is the best way. Maybe read beforehand a bit about parsing (recursive decent, lexer/parser seperation...). For simple structures it is not hard to get a working solution quickly.
Apache POI - the Java API for Microsoft Documents
Apache PDFBox - Java PDF Library
easier one will be Apache Tika - a content analysis toolkit, toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
it uses pdfbox and poi internally
use: java -jar tika-app-0.9.jar [option] [file] -t
will parse the file(s) specified on the
command line and output the extracted text content
Hi I'm looking to parse spreadsheets (xls/ods) in Groovy. I have been using the Roo library for Ruby and was looking to try the same tasks in Groovy, as Java is already installed on a development server I use, and I would like to keep the number of technologies on the server to a simple core few.
I am aware that the ods format is zipped XML, and so can be parsed as such, but I would like to process the file using spreadsheet concepts, not XML concepts.
The ability to process xls files is not of major importance, but would save me having to save multiple xls files to ods (as this is for parsing data from clients).
Thanks
I would suggest Apache POI for access to .xls files.
I've never had to work with the .ods format, so no information on that one.
There's also JExcelAPI, which has a nice, clean, simple interface (for the most part).
Can't help you with ODS Files though.
How about looking at 'odftoolkit' ? http://odftoolkit.openoffice.org/
Groovy in Action has a chapter named "Groovy on Windows" that discusses using Scriptom, a Groovy/COM bridge (using JACOB under the covers), to access several Windows apps including Excel.
For OpenOffice, you can use ODF Toolkit, as Amit pointed out.
I second jdmichal's vote for Apache POI. I have selected it as our library of choose to handle Excel file input (.XLS). The project is also working on the .XLSX file format if you ever decide you want to support that. Based on your specifications, I don't think you want to get into converting things into CSV and it seems like you have established input and output paths. For anyone who hasn't had the joy of dealing with CSV to Excel conversion, it can get a bit dicey. I have spent hours dealing with issues created by Excel converting string data to numeric data. You can see other testimonies to this effect on the POI Case Studies page. Beyond these issues, I simply don't want to personally have to handle these inputs. I'd rather invest the programming effort and streamline the workflow for the future.
I too have not dealt with ODF and have no plans to support it in my current project. You might want to check out the OpenOffice.org ODF Toolkit Project.
Good luck and have fun,
- D.
I suggest you to take a look at SimpleXlsBuilder and SimpleXlsSlurper, both are based on apache POI and can fit your basic needs for reading from and writing to Excel 97 spreadsheets in a concise way.
If your spreadsheets are simple enught - without charts and other embedded contents - you should simply convert the spreadsheet to CSV.
Pros:
Both xls and ods will produce the same CSV - You'll have to handle just one input type.
You won't have to mess with new versions of (Open) Office.
Handling plaintext is always more fun than other obscure formats.
Cons:
One that I can think of - finding a reliable converter from xls and odf to csv. Shouldn't be too hard - OpenOffice has a built in one.
A couple things:
1) I agree that using a CSV format can simplify some of the development work. OpenCSV can help with processing CSV files. There are other good CSV parsers for Java out there. Just remember that anything that's available for Java can be used by Groovy due to Groovy's unparalleled integration with Java.
2) I know you said you wanted to avoid handling XML, but Groovy makes XML processing exceedingly simple.