How to create docx file using java? - java

I am trying to create a .docx file using java but for some reason I can't open the file. The error comes "Problem with the content of file". Does anyone knows how to fix this problem?

Tried multiple libraries
This one is free.
http://www.docx4java.org/trac/docx4j
Also check aspose (is not free)
http://www.aspose.com/categories/java-components/aspose.total-for-java/default.aspx

you may use http://poi.apache.org/ to create such files.

Microsoft Word's docx files are zip files with specific content inside them. Simply creating a file in Java and writing some text to them isn't going to create a valid docx file that Word will recognize, even if you give it a .docx extension.
To create them from Java you can use the Apache POI XWPF library. That will give you some Java classes that'll create and write contents to docx files that will work with Word.

It sounds like you are producing a corrupt document. If it really is DOCX format, then open it with a ZIP tool and examine the contexts of the XML files - there's a reasonable chance you are simply producing invalid XML and so looking at it with a browser or XML editor will help you there.
You probably need to say HOW you are producing the DOCX file so people can make better suggestions.
If you are looking for more options, I would look at docx4j and Docmosis. Please note I work for the company that created Docmosis.

Related

How do we get to know whether a macro is present or not in a word document?

Using java how to find whether a macro is present or not in a Microsoft word document. Tried using any switch command for WinWord.exe but there is no switch so that we can find it.
Use a library that can parse Word documents. Apache POI is good choice as long as the documents aren't too big.
The library allows you to load the document. Afterwards, you can examine the various parts.
Bug 52949 has an attachment with sample code how to extract Macro code. This should get you started.
You you're using the new XML format .docx / OOXML, then the word file is in fact a ZIP archive that you can unpack using the standard Java library. Inside, you will find a lot of XML files. The macros should be there as well.

Parse XLSX files in Java without external libraries?

quick question I've been asked to create a couple of parsers for XLSX file formats. Pretty much everywhere I've read says to grab the POI libraries, however the system I am working on are very touchy about bringing on external APIs so I'd far rather have to do some extra leg work myself then go down that route.
So is it possible (without spending days of coding) via a SAXParser to Parse an XLSX file or am I a mug if I dont use the POI libraries?
Cheers
* UPDATE *
Since extracting the XLSX fileand having a better look at the archive, I believe I can now parse these files without spending days coding, I could probably extract the information within a few hours. I am however only looking to extract the physical cell data and not any reference data on those values i.e. cell reference. I am also looking to extract the XLSX metadata. I'll provide a quick answer on how I did this when I am done for future reference.
Without spending few days of coding...it's not possible...you have to write code for at least two three days....it's just a zip file but bunch of XML files and manifest xml
a standard xlsx file is not xml so nope its not possible.
correction: Walter Laan is correct, xlsx format is indeed a zip file full of xml's and shoud be relativly easily parseable
Effectively I did this, but obviously tailored my java to read the specific xlsx XML structure.
To open the xlsx in java use the ZipEntry API's & enumerate that entry to ensure you drill down through all the various folder structures. Then follow the guide below to read the XML:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
Cheers

How to load old Microsoft Office XML file (Excel) using Java

I'm not able to load an Excel file in the older Office XML format (think Office 2002 or 2003 version) into Java. I tried JXL and Apache's POI (version 3.7). POI doesn't work since it appears to want the newer Office .xlsx format.
Here's an example of the older Office XML format.
One can generate a similar XML file from MS Excel 2010 by saving the workbook as the format "XML Spreadsheet 2003"?
Are there any open-source Java libraries that will load the XMLSS format? Otherwise I have no choice but to write a custom parser: read the XML file then interpret the cell tags to build out the cell matrix. In this XML format, any rows with empty cell values are skipped, the next cell with data positioned with an index attribute that acts like an offset in the columns, I assume to save space in the XML file.
The format is called SpreadsheetML (do not confuse with .xlsx which is also xml-based), a library called Xelem can handle it:
import nl.fountain.xelem.excel.Workbook;
import nl.fountain.xelem.lex.ExcelReader;
//...
ExcelReader reader = new ExcelReader();
Workbook xlWorkbook = reader.getWorkbook("c:\\my\\spreadsheet.xml");
System.out.println(xlWorkbook.getSheetNames());
Copying Mark Beardsley's answer from POI team http://apache-poi.1045710.n5.nabble.com/How-to-convert-xml-to-xls-td2306602.html :
You have got an Office 2003 xml file there, not an OpenXML file; it is an early attempt by Microsoft to create an xml based file format for Excel and it is in that sense a 'valid' Office file format.
Sadly, POI cannot interpret this file at all and that is why you saw the exception when you tried to wrap it up in the InputStream and pass it to WorkbookFactory(s) constructor. You do however have a number of options;
You could use Excel itself and manually open and save each file you wish to convert, as you already have done.
If you have access to Visual Studio and can write Visual Basic or C# code then you could use a control that will allow you to control Excel programmatically. This way you could automate a file conversion process using Excel itself. Then once the file has been converted wither to the binary or OpenXML formats, POI can be used to process it.
If you are running on a stand alone PC on which a copy of Excel is installed and using the Windows operating system, then you could use OLE to do something very similar from Java code. As above, POI can be used to process the file following the conversion.
If you have access to OpenOffice, it has a rather good API that is accessible from Java code. You could use it to convert between the file types for you - it is simply a matter of discovering the correct filter to use in this case. OpenOffice is good for all except the most complex files and you should be able to use POI to process the file following conversion. However, if you choose this route, it may be best to do all of the work using OpenOffice's UNO api.
Depending upon what you want to do with the file's contents, you could create your own parser using core java code and either the SAX or Xerces parsers (consider using xmlBeans (http://xmlbeans.apache.org/) ). If you simply open the original xml file using a simple text editor, you can see that the structure is not complex and, if all you wish to get at is the raw data it contains, this could be your best option.
After a lot of pain I've found a solution to this. JODConverter uses the OpenOffice.org/LibreOffice API and can convert SpreadsheetML to whatever formats OpenOffice.org suppports.
You might get some result using the OpenOffice API. If not directly you could probably convert to a 'supported' format.
Otherwise the schema for the Office 2003 'SpreadsheetML' isn't very complicated. I have succesfully created an xslt scenario to convert a resultset (database query) to a (simple yet effective) Excel 2003 document (XML format). The other way around should not be very hard to achieve.
Cheers,
Wim
The answer today was to ask the vendor to change their Excel file format to an Excel binary rather than the old Office XML. Doing so allowed me to use Apache POI 3.7 to read the file with no issues. I appreciate the answers, as I had no idea there was no direct support in the Java-based open source libraries for this old Office XML format. Now I know next time to check earlier to see what format the Excel files are in before committing to a timeline.
I had the same problem some time ago, ended up writing a SAX parser to read the XML file. I wrote a blog post about it here.
You can find the sample project to parse the file in Github.

How to convert pdf to doc file in java

need to convert a pdf file to a doc file. I found different type of example to generate pdf file but not got pdf to doc.
What your asking is actually very difficult
I recommend you start here and look for a good parsing library. then you would have to write it out in .doc format. Inevitably a lot of the formatting and extra information would be lost. it would be a lot easier to output to docx format, but i assume thats not what your looking for.
I see few possible solutions:
Davisor Publishor 6.2 probably can be used, but it is commercial, and seems that generates only txt from pdf... just have a look
parse pdf with iText, and then
generate doc with Apache POI -
another way to try (free one ;)
look for command line tools, like
Convert PDF To DOC and execute
them from java
Otherwise take a look at Con's answer, there is a link to the list with java pdf processing libraries, maybe some library can do it directly, or can be used to parse pdf (better than iText), and then just use Apache POI to generate doc. Hope it helps ;)

How do I use Apache POI to read a .DOC file in Java to separate images from text?

I need to read a Word .doc file from Java that has text and images. I need to recognize the images & text and separate them into 2 files.
I've recently heard about "Apache POI." How I can use Apache POI to read Word .doc files?
The examples and sample code on apache's site are pretty good. I recommend you start there.
http://poi.apache.org/hwpf/quick-guide.html
To get specific bits of text, first create a org.apache.poi.hwpf.HWPFDocument. Fetch the range with getRange(), then get paragraphs from that. You can then get text and other properties.
Here for an example of extracting an image. Here for the latest revision as of this writing.
And of course, the Javadocs
Note that, according to the POI site,
HWPF is still in early development.
It's not free (or even cheap!) but Aspose.Words should be able to do this. Their evaluation download will let you play with small files.
Do the destination files also have to be Docs? You could open the docs in Office and save them out as HTML. Then the separation becomes trivial. RTF is also a viable option, but I can't recommend a good RTF parser off the top of my head.
Edit to say: I just remembered another possible solution: Jacob, but you'll need an instance of Office running on the same machine. It's short for Java COM Bridge and it lets you make calls to the COM libraries in Office to manipulate the documents. I'm sure it's not as scary as it might sound!

Categories