Document text extraction and modification

Document text extraction and modification - java

I recently came across Apache Tika, a beautiful toolkit which handles files of several types to extract the text (and some other information such as metadata).
The problem which I am facing is that given a document (in some format such as PDF, DOC, XLS and so on), I need to extract the text, modify some of it, and re-build the document in its original format (with the modified text). To my knowledge, Tika provides the facility of extraction of text, but does not 'stitch' modified documents back.
I feel that there are some libraries which do this for specific file types, but I am not aware of any toolkit similar to Tika, which provides an end-to-end solution for me by handling all the file types supported by Tika. I am also not sure if Tika itself can do this for me.
If someone knows anything of this sort, please let me know. I am looking for a library written in Java.
Regards,
Salil
EDIT: coderanch.com/how-to/java/AccessingFileFormats has several toolkits lister, but I would appreciate something that wraps all the formats supported by Tika comprehensively.

Apache POI
Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.
OLE2 files include most Microsoft Office files such as XLS, DOC, and PPT as well as MFC serialization API based file formats. The project provides APIs for the OLE2 Filesystem (POIFS) and OLE2 Document Properties (HPSF).
Office OpenXML Format is the new standards based XML file format found in Microsoft Office 2007 and 2008. This includes XLSX, DOCX and PPTX.
Eclipse Birt
Q: What report output formats does BIRT support?
Release 2.1 supports HTML, Paginated HTML and PDF.
Release 2.2 support HTML, Paginated HTML, PDF, WORD, XLS, and PostScript

It appears that there are no better toolkits as mentioned here. The only way out would be to write your own wrapper for one or more of these toolkits to get the work done. It would have been great if Tika itself provided that facility, but that unfortunately does not seem to be the case.

Related

Convert Excel to Word using Aspose.Word for Java

I have a requirement to convert excel template to word. Then using Aspose.Word for JAVA I can merge all word templates (including the converted excel template) to PDF file.
Aspose, iText, POI, Jasper, Birt etc doesn't support this. Is there any API in Java which allows this kind of conversion?

Although, you cannot convert Excel spreadsheets to Word documents directly via Aspose.Cells APIs. FYI, Aspose.Cells is a spreadsheet management library that manages MS Excel file formats only. We have another component i.e., Aspose.Words that manages or merges MS word documents. But, I think for your specific requirements, you have to use two Aspose APIs with two steps, that are; Aspose.Cells & Aspose.Pdf to achieve your goal. You will use Aspose.Cells APIs that allows you to convert the spreadsheet formats (XLS/XLSX, etc.) to PDF format. Then you will use Aspose.Pdf APIs which allows you to convert PDF to Word document for your needs.
I am working as Support developer/ Evangelist at Aspose.

you can try the Apache POI - the Java API for Microsoft Documents..
have a look here
http://viralpatel.net/blogs/java-read-write-excel-file-apache-poi/

Difference between Apache POI api and Apache Tika Api?

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.
While going thru tika, I came across POI API and found more friendly to use it.
we may have requirement to parse PDF file in further.
I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.
Thanks,
Krishna

Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content
Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.
If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.

Apache POI is full blown parser/writer for most of the Microsoft Documents. It supports both newly introduced 2007 (XSSF) format and Microsoft 2003 file formats (HSSF). Apache POI provides two level of API for parsing and generating Microsoft files. One that is higher level API that is bit memory intensive which reads the whole file and keeps in the memory something similar to DOM parsing in XML and lower level API for memory intensive use which is similar to SAX/StAX parsing.
On the other hand Apache Tika is content analysis tool which I guess only supports Microsoft Excel and lot of other extraction components. There is no support for writing new files or generating content from Tika, anyway that is not the their use case at all.
So, you have to choose depending on your need.

Is there a library capable of generating XSL-FO from Office XML documents like DOCX, XLSX?

Does anyone know of a library that is capable of generating XSL-FO from a Microsoft Office Open XML file, such as a Word DOCX or an Excel XLSX?
Given that these Office files are basically XML in a ZIP file, I figure it would be pretty straightforward to generate XSL-FO from them by applying appropriate XSLT transformations — though writing the XSLT would take some time. But if it is a straightforward as I suspect, then maybe someone has written a library that does it, or released XSLT transformations that do it.
This Microsoft MSDN library article contains an example of creating XSL-FO with Word 2003 WordprocessingML files, but I haven't seen anything for the newer Open XML format.
Does anyone have suggestions? A Java library would be preferable, but anything would be considered.

docx4j has support for this, for docx; since v3.3.0 it is in a separate project https://github.com/plutext/docx4j-export-FO
It uses XSLT to create the XSL-FO. The XSLT uses Java extension functions to invoke docx4j methods to do much of the work, keeping the XSLT itself relatively simple.
docx4j uses FOP to convert XSL FO to PDF.
docx4j has support for xlsx, but no built in export from XLSX to XSL FO.

RenderX has a set of publicly available stylesheets that convert WordML into XSL-FO
http://www.renderx.com/tools/word2fo.html
These stylesheets were prepared by
RenderX's development team and
Microsoft for general use. They are
used to convert documents in
Microsoft's WordprocessingML XML
vocabulary into documents in the W3C's
XSL FO (XSLFO) vocabulary. These
generic stylesheets produce XSL FO
(XSLFO) suitable for RenderX XEP
Engine.

Generate PowerPoint 2007/2010 file using Java

Does anyone know of any API (commercial or open-source) that can generate/edit PowerPoint 2007/2010 presentations through Java. I have a template in the PowerPoint 2007/2010 format that I require to edit/update. So far I have been converting the .pptx file to xml and then editing and storing it back as .pptx. But the file gets corrupted while opening.
Is anyone aware of any other method or API that do this in Java?

We have done it programmatically (closed source at the moment, sorry) so might be able to help, but beware of a few gotchas.
One is that the POI project (at least when we looked at it last year), was quite incomplete. It didn't do PPTX Charts - which is the one feature we wanted. Infact the POI site may not be upto date, but they don't appear to support PowerPoint 20087 format (http://poi.apache.org/slideshow/index.html). Everybody recommends this project, but our evaluation was that it was pretty much useless for generating PowerPoint 2007 files via Java. Your milage may vary.
Apose also had some significant limitations when we looked at it; not doing Charts in PowerPoint 2007 being the blocking issue for us.
Another issue is that PowerPoint 2007 can be quite buggy. We have had a number of progammatically produced PPT files that caused lock ups, but when testing, we found that we can repro crashes and lock ups with simple PPTX documents created in PowerPoint 2007 - i.e. not our code.
In the end, we did the following: Unpacked a 'template' PowerPoint file to a folder, then on demand, filled the template XML with new values, zipped it up, renaming various elements & delivered it to the user as a valid PPTX. Works OK, other than the odd PowerPoint crash when people edit the file. If there was a market for it, I guess we could package up the code as a webservice (i.e xml/csv -> PPTX) or put together a commerical package, but we wouldn't do it for free.

docx4j (apache license) now includes a pptx4j component, which can open/edit/save pptx documents.

Yes. Check this out http://poi.apache.org/, they just released version 3.6 which now supports Office 2007 format documents. The best part is that it's free!

To generate a PowerPoint presentation from a template file, you can use PPT Templates.
This library provides a fluent API to replace variables inside the PPT template:
try(FileOutputStream out = new FileOutputStream("generated.pptx")) {
new PptMapper()
.text("variable", "Hello")
.text("other_variable", "World!")
.processTemplate(PptTemplateDemo.class.getResourceAsStream("/title.pptx"))
.write(out);
}
With this library, you can process text and images in the template.

Another solution that may work for you is Windward Reports (disclaimer, I'm the founder & CEO there). It uses PPTX as one of the supported template formats and merges in data to then generate a PPTX (or PDF, etc.) output.
If the edit/update you need can be handled via the data tags in Windward, this should be trivial for you. If what you need cannot be handled by the tags, then this won't work for you.

Well as mentioned by GrantB best way is to create a template, then load the template , traverse the xml graph,update the data and stream out to a output ppt. We recently did it to generate reports for clients that had complex visuals and charts in ppt. You can have a look here generate ppt in java

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Also I want to know how to add meta data while indexing so that i can boost some parameters

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
One of them is Apache Tika, a sub-project of Lucene.
Apache POI is a more general document handling project inside Apache.
There are also some commercial alternatives.

You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format
The code will look like this.
Reader reader = new Tika().parse(stream);

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

see https://github.com/WolfgangFahl/pdfindexer
for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text,
index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Document text extraction and modification - java

It appears that there are no better toolkits as mentioned here. The only way out would be to write your own wrapper for one or more of these toolkits to get the work done. It would have been great if Tika itself provided that facility, but that unfortunately does not seem to be the case.

Related

Convert Excel to Word using Aspose.Word for Java

Difference between Apache POI api and Apache Tika Api?

Is there a library capable of generating XSL-FO from Office XML documents like DOCX, XLSX?

Generate PowerPoint 2007/2010 file using Java

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Categories

Resources