Converting Microsoft Office documents and PDF documents to image file with Java

Converting Microsoft Office documents and PDF documents to image file with Java - java

In my current project, I need to convert Microsoft Office documents and PDF documents to image file with Java. Is there any open source Java library for that. And if so, which is the most reliable?

you can try using JODConverter .
It is a open source project. The Java OpenDocument Converter, converts documents between different office formats.
Picked from here

You can use Apache PDFBox for PDF and Apache POI for converting Microsoft office documents.
Apache PDFBox
Apache POI

You can also use docx4j to convert the Office Docs to PDF.

Related

Convert Excel to Word using Aspose.Word for Java

I have a requirement to convert excel template to word. Then using Aspose.Word for JAVA I can merge all word templates (including the converted excel template) to PDF file.
Aspose, iText, POI, Jasper, Birt etc doesn't support this. Is there any API in Java which allows this kind of conversion?

Although, you cannot convert Excel spreadsheets to Word documents directly via Aspose.Cells APIs. FYI, Aspose.Cells is a spreadsheet management library that manages MS Excel file formats only. We have another component i.e., Aspose.Words that manages or merges MS word documents. But, I think for your specific requirements, you have to use two Aspose APIs with two steps, that are; Aspose.Cells & Aspose.Pdf to achieve your goal. You will use Aspose.Cells APIs that allows you to convert the spreadsheet formats (XLS/XLSX, etc.) to PDF format. Then you will use Aspose.Pdf APIs which allows you to convert PDF to Word document for your needs.
I am working as Support developer/ Evangelist at Aspose.

you can try the Apache POI - the Java API for Microsoft Documents..
have a look here
http://viralpatel.net/blogs/java-read-write-excel-file-apache-poi/

Document text extraction and modification

I recently came across Apache Tika, a beautiful toolkit which handles files of several types to extract the text (and some other information such as metadata).
The problem which I am facing is that given a document (in some format such as PDF, DOC, XLS and so on), I need to extract the text, modify some of it, and re-build the document in its original format (with the modified text). To my knowledge, Tika provides the facility of extraction of text, but does not 'stitch' modified documents back.
I feel that there are some libraries which do this for specific file types, but I am not aware of any toolkit similar to Tika, which provides an end-to-end solution for me by handling all the file types supported by Tika. I am also not sure if Tika itself can do this for me.
If someone knows anything of this sort, please let me know. I am looking for a library written in Java.
Regards,
Salil
EDIT: coderanch.com/how-to/java/AccessingFileFormats has several toolkits lister, but I would appreciate something that wraps all the formats supported by Tika comprehensively.

Apache POI
Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.
OLE2 files include most Microsoft Office files such as XLS, DOC, and PPT as well as MFC serialization API based file formats. The project provides APIs for the OLE2 Filesystem (POIFS) and OLE2 Document Properties (HPSF).
Office OpenXML Format is the new standards based XML file format found in Microsoft Office 2007 and 2008. This includes XLSX, DOCX and PPTX.
Eclipse Birt
Q: What report output formats does BIRT support?
Release 2.1 supports HTML, Paginated HTML and PDF.
Release 2.2 support HTML, Paginated HTML, PDF, WORD, XLS, and PostScript

It appears that there are no better toolkits as mentioned here. The only way out would be to write your own wrapper for one or more of these toolkits to get the work done. It would have been great if Tika itself provided that facility, but that unfortunately does not seem to be the case.

Java to Excel 2010

I have a piece of code which is creating an Excel file by using an open source library named OpenXLS-6.0.6.
My pc has Windows XP Professional and Office 2003.
However I have notice that since I have migrated to Windows 7 and Office 2010, I am not able to open the generated Excel file anymore.
I went to the OpenXLS website and indeed is specified that: "Compatible with Excel '97-2003 (.xls) file formats "
Is anybody aware of a library that is able to generate Excel files compatible with Office 2010?
I had a quick check on ExcelAPI and POI, but it is mentioned that they deal with Excel up to the Office 2003 version(at least this was my understanding).

There is Apache POI. It is not the easiest library to use, but creating Excel files is the easiest thing that you can do with it. How to., How to 2. I have also seen commercial libraries that have free versions too.
XSSF is the POI Project's pure Java implementation of the Excel 2007
OOXML (.xlsx) file format.
Xlsx is also used in Office 2010.

.xlsx are a simple zip package that contains some simple .xml file , so unzip it and do what you want to do with data.
read more documents about openxml and MS office in microsoft web site.

How to load old Microsoft Office XML file (Excel) using Java

I'm not able to load an Excel file in the older Office XML format (think Office 2002 or 2003 version) into Java. I tried JXL and Apache's POI (version 3.7). POI doesn't work since it appears to want the newer Office .xlsx format.
Here's an example of the older Office XML format.
One can generate a similar XML file from MS Excel 2010 by saving the workbook as the format "XML Spreadsheet 2003"?
Are there any open-source Java libraries that will load the XMLSS format? Otherwise I have no choice but to write a custom parser: read the XML file then interpret the cell tags to build out the cell matrix. In this XML format, any rows with empty cell values are skipped, the next cell with data positioned with an index attribute that acts like an offset in the columns, I assume to save space in the XML file.

The format is called SpreadsheetML (do not confuse with .xlsx which is also xml-based), a library called Xelem can handle it:
import nl.fountain.xelem.excel.Workbook;
import nl.fountain.xelem.lex.ExcelReader;
//...
ExcelReader reader = new ExcelReader();
Workbook xlWorkbook = reader.getWorkbook("c:\\my\\spreadsheet.xml");
System.out.println(xlWorkbook.getSheetNames());

Copying Mark Beardsley's answer from POI team http://apache-poi.1045710.n5.nabble.com/How-to-convert-xml-to-xls-td2306602.html :
You have got an Office 2003 xml file there, not an OpenXML file; it is an early attempt by Microsoft to create an xml based file format for Excel and it is in that sense a 'valid' Office file format.
Sadly, POI cannot interpret this file at all and that is why you saw the exception when you tried to wrap it up in the InputStream and pass it to WorkbookFactory(s) constructor. You do however have a number of options;
You could use Excel itself and manually open and save each file you wish to convert, as you already have done.
If you have access to Visual Studio and can write Visual Basic or C# code then you could use a control that will allow you to control Excel programmatically. This way you could automate a file conversion process using Excel itself. Then once the file has been converted wither to the binary or OpenXML formats, POI can be used to process it.
If you are running on a stand alone PC on which a copy of Excel is installed and using the Windows operating system, then you could use OLE to do something very similar from Java code. As above, POI can be used to process the file following the conversion.
If you have access to OpenOffice, it has a rather good API that is accessible from Java code. You could use it to convert between the file types for you - it is simply a matter of discovering the correct filter to use in this case. OpenOffice is good for all except the most complex files and you should be able to use POI to process the file following conversion. However, if you choose this route, it may be best to do all of the work using OpenOffice's UNO api.
Depending upon what you want to do with the file's contents, you could create your own parser using core java code and either the SAX or Xerces parsers (consider using xmlBeans (http://xmlbeans.apache.org/) ). If you simply open the original xml file using a simple text editor, you can see that the structure is not complex and, if all you wish to get at is the raw data it contains, this could be your best option.

After a lot of pain I've found a solution to this. JODConverter uses the OpenOffice.org/LibreOffice API and can convert SpreadsheetML to whatever formats OpenOffice.org suppports.

You might get some result using the OpenOffice API. If not directly you could probably convert to a 'supported' format.
Otherwise the schema for the Office 2003 'SpreadsheetML' isn't very complicated. I have succesfully created an xslt scenario to convert a resultset (database query) to a (simple yet effective) Excel 2003 document (XML format). The other way around should not be very hard to achieve.
Cheers,
Wim

The answer today was to ask the vendor to change their Excel file format to an Excel binary rather than the old Office XML. Doing so allowed me to use Apache POI 3.7 to read the file with no issues. I appreciate the answers, as I had no idea there was no direct support in the Java-based open source libraries for this old Office XML format. Now I know next time to check earlier to see what format the Excel files are in before committing to a timeline.

I had the same problem some time ago, ended up writing a SAX parser to read the XML file. I wrote a blog post about it here.
You can find the sample project to parse the file in Github.

How to index pdf, ppt, xl files in lucene (java based or python or php any of these is fine)?

Also I want to know how to add meta data while indexing so that i can boost some parameters

There are several frameworks for extracting text suitable for Lucene indexing from rich text files (pdf, ppt etc.)
One of them is Apache Tika, a sub-project of Lucene.
Apache POI is a more general document handling project inside Apache.
There are also some commercial alternatives.

You can use Apache Tika. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format
The code will look like this.
Reader reader = new Tika().parse(stream);

Lucene indexes text not files - you'll need some other process for extracting the text out of the file and running Lucene over that.

see https://github.com/WolfgangFahl/pdfindexer
for a java solution that uses PDFBox and Apache Lucene to split the PDF files page by page to text,
index these text-pages and create a resulting html index file that links to the pages in the pdf sources by using a corresponding open parameter.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.