Apache POI Replace String in Word

Apache POI Replace String in Word - java

I've recently been working on an automated system to make and print out letters to post. The system works as follows:
I create a file, with all the information in it, and replace some things with %... placeholders. For example, %name, %date, etc.
When I run the application, I can select a name from the list, and it automatically loads the document, replaces all the placeholders with information supplied by a MySQL database, and prints out the document. For testing purposes, I'm just saving the document for now.
I've found some tutorials on the internet, and found a code that suited my needs. Unfortunately, this code only works for Word versions older than 2007 (.doc files). What would I change for 2007+ compatibility (.docx files)?
public static void main(String[] args){
try{
FileInputStream fis = new FileInputStream("/Users/Jasper/Desktop/document.doc");
POIFSFileSystem fs = new POIFSFileSystem(fis);
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
range.replaceText("%name", "Jasper");
range.replaceText("%age", "17");
FileOutputStream fos = new FileOutputStream("/Users/Jasper/Desktop/document2.doc");
doc.write(fos);
fis.close();
fos.close();
}catch(Exception e){
e.printStackTrace();
}
}

Please see my answer here I used this solution to automate generation of documents and it has been used since more than 1 year on production.

Related

Trying to sign a pdf document with java. Why is the signature invalid in the PDF files?

I'm currently working with PDFs on a Java application that makes some modifications to PDF Documents.
Currently, the signing of these PDFs is working, as I am using classes such as FileInputStream and FileOutputStream. Basically, I copy the original documents from a source folder, and then put them in a output folder, with. I am using PDDocument class with pdfbox 1.8.9
However, I want to use the same file, meaning I don't pretend to copy the PDFs anymore. I want to grab the document, sign it, and overwrite the original one.
Since I learned that having FileInputStream and FileOutputStream pointing at the same file is not a good idea, I simply tried to use the File class.
I tried the following:
File file = new File(locOriginal);
PDDocument doc = PDDocument.load(file);
PDSignature signature = new PDSignature();
Overlay overlay = new Overlay();
//The signature itself. It has not been modified
signature.setFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED); // default filter
signature.setSubFilter(PDSignature.SUBFILTER_ADBE_PKCS7_DETACHED);
if (msg.getAreaNegocio().startsWith("A")) {
signature.setName(this.campoCertificadoAcquiring);
signature.setLocation(this.localCertificadoAcquiring);
signature.setReason(this.razaoCertificadoAcquiring);
}else {
signature.setName(this.campoCertificadoIssuing);
signature.setLocation(this.localCertificadoIssuing);
signature.setReason(this.razaoCertificadoIssuing);
}
// register signature dictionary and sign interface
doc.addSignature(signature,this);
doc.saveIncremental(file.getAbsolutePath());
doc.close();
My PDF file does get overwritten as intended, yet, the signature is not valid anymore when I open the file. I read these questions... Does it relate to any of these issues? What can I do to solve to this?
PDFBox 1.8.10: Fill and Sign PDF produces invalid signatures
PDFBox - opening and saving a signed pdf invalidates my signature
Thanks for the help!

The 1.8.* saveIncremental(filename) was buggy until PDFBox 1.8.16. This is described in PDFBOX-4312 but is confusing because the user deleted most of his own messages and had multiple other problems. If you insist on using an outdated version (that has a security issue), then try this code instead of calling saveIncremental(filename):
//BEWARE: do not "optimize" this method by using buffered streams,
// because COSStandardOutputStream only allows seeking
// if a FileOutputStream is passed, see PDFBOX-4312.
FileInputStream fis = new FileInputStream(fileName);
byte[] ba = IOUtils.toByteArray(fis);
fis.close();
FileOutputStream fos = new FileOutputStream(fileName);
fos.write(ba);
fis = new FileInputStream(fileName);
saveIncremental(fis, fos);
And no, I don't think that the questions you linked to related to your issue.
Btw I don't consider overwriting the original file to be a good idea. You are risking the loss of your file if there is an error or a power loss.
See also the comment by mkl: setFilter() is usually called with parameter PDSignature.FILTER_ADOBE_PPKLITE.

how to know whether a file is .docx or .doc format from Apache POI

I know we can get it done by extension or by mime type, do we have any other way through which we can get the idea of type of file whether it is .docx or .doc.

If it is just a matter of decided whether a collection of files known to either be .doc or .docx but are not marked accordingly with an extension, you can use the fact that a .docx file is a zipped collection of files. Something to the tune as follows might help:
boolean isZip = new ZipInputStream( fileStream ).getNextEntry() != null;
where fileStream is whatever file or other input stream you wish to evaluate. You could further evaluate a zipped file by looking for key .docx entries. A good starting reference is Word Document (DOCX). Likewise, if you know it is just a binary file, you can test for Word's File Information Block (see Word (.doc) Binary File Format)

You could use Apache Tika for content Detection. But you should been aware that this is a huge framework (many required dependencies) for such a small task.

There is a way, no strightforward though. But with Apache POI, you can locate it.
Try to read a .docx file using HWPFDocument Class. It would give you the following error
org.apache.poi.poifs.filesystem.OfficeXmlFileException: The supplied
data appears to be in the Office 2007+ XML. You are calling the part
of POI that deals with OLE2 Office Documents. You need to call a
different part of POI to process this data (eg XSSF instead of HSSF)
String filePath = "C:\\XXXX\XXXX.docx";
FileInputStream inStream;
try {
inStream = new FileInputStream(new File(filePath));
HWPFDocument doc = new HWPFDocument(inStream);
WordExtractor wordExtractor = new WordExtractor(doc);
System.out.println("Getting words"+wordExtractor.getText());
} catch (Exception e) {
System.out.print("Its not a .doc format");
}
.docx can be read using XWPFDocument Class.

Why dont you use Apache Tika:
File file = new File('File Here');
Tika tika = new Tika();
String filetype = tika.detect(file);
System.out.println(filetype);

Assuming you're using Apache POI, you have a few options.
One is to grab the first few bytes of the file, and ask POIFSFileSystem with the hasPOIFSHeader(byte) method. If you have a stream that supports mark/reset, you can instead use POIFSFileSystem.hasPOIFSHeader(InputStream). If those return true then try to open it as a .doc with HWPF, otherwise try as .docx with XWPF
Otherwise, if you prefer a try/catch way, try to open it with POIFSFileSystem and catch OfficeXmlFileException - if it opens fine it's .doc, if you get the exception it's .docx
If you look at the source code for WorkbookFactory you'll see the first pattern in use, you can copy a similar set of logic form that

I can't import com.itextpdf.text.Document class

I'm building an android app and I want to use iText for creating pdf file, but I can't use Document class. As I seen in tutorials, there should be import com.itextpdf.text.Document for using Document class. For this app, I'm using com.itextpdf:itext-pdfa:5.5.9 library. I want to create a simple pdf file with 2 paragraphs, something like this:
try{
File pdfFolder = new File(Environment.getExternalStoragePublicDirectory(
Environment.DIRECTORY_DOCUMENTS), "pdfdemo");
if (!pdfFolder.exists()) {
pdfFolder.mkdir();
}
Date date = new Date() ;
String timeStamp = new SimpleDateFormat("yyyyMMdd_HHmmss").format(date);
File myFile = new File(pdfFolder + timeStamp + ".pdf");
OutputStream output = new FileOutputStream(myFile);
Document document = new Document();
PdfAWriter.getInstance(document, output);
document.open();
document.add(new Paragraph(mSubjectEditText.getText().toString()));
document.add(new Paragraph(mBodyEditText.getText().toString()));
document.close();
}catch (Exception e) {}
'
Could anyone help me with this problem? What am I doing wrong?

You say:
I'm using com.itextpdf:itext-pdfa:5.5.9 library
That is wrong for two reasons:
itext-pdfa is an addon to iText that is meant for writing or manipulating PDF/A documents. It requires the core iText libary. Read about the different parts of iText on the official web site: https://developers.itextpdf.com/itext-java
You say you want to use iText on Android, but you are referring to iText for Java. iText for Java contains classes that are not allowed on Android (java.awt.*, javax.nio,...). You should use the Android port for iText, which is called iTextG: https://developers.itextpdf.com/itextg-android
It's as if you're using iText without having visited the official iText web site. How is that even possible?

Just open your app level gradle file and add following code into your dependencies
implementation 'com.itextpdf:itext-pdfa:5.5.9'
It works for me

Viewing .doc file with java applet

I have a web application. I've generated MS Word document in xml format (Word 2003 XML Document) on server side. I need to show this document to a user on a client side using some kind of viewer. So, question is: what libraries I can use to solve this problem? I need an API to view word document on client side using java.

You cannot reliably display a Word document in a web page using Java (or any other simple technology for that matter). There are several commercial libraries out there to render Word, but you will not find these to be easy, cheap or reliable solutions.
What you should do is the following:
(1) Open the Word engine on the server using a .NET program
(2) Convert the document to Rich Text using the Word engine
(3) Display the rich text either using the RTF Swing widget, or convert to HTML:
String rtf = [your document rich text];
BufferedReader input = new BufferedReader(new StringReader(rtf));
RTFEditorKit rtfKit = new RTFEditorKit();
StyledDocument doc = (StyledDocument) rtfKit.createDefaultDocument();
rtfEdtrKt.read( input, doc, 0 );
input.close();
HTMLEditorKit htmlKit = new HTMLEditorKit();
StringWriter output = new StringWriter();
htmlKit.write( output, doc, 0, doc.getLength());
String html = output.toString();
The main risk in this approach is that the Word engine will either crash or have a memory leak. For this reason you have to have a mechanism for restarting it periodically and testing it to make sure it is functional and not hogging memory.

docx4all is a Swing-based applet which does Word 2007 XML (ie not Word 2003 XML), which we wrote several years ago.
Get it from svn.
That's a possible approach for editing. If all you want is a viewer, which not convert to HTML or PDF? You can use docx4j for that. (Disclosure: "my" project).

You might have a look at the Apache POI - Java API to Handle Microsoft Word Files which is able to read all kinds of word documents (OLE2 and OOXML formats, .doc and .docx extensions respectively).
Reading a doc file can be easy as:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
You can find more at: HWPF Quick-Guide (specifically HWPF unit tests)
Note that, according to the POI site:
HWPF is still in early development.

I'd suggest looking at the openoffice source code and implement that.
It's supposed to be written in java.

Generating docx in java

I have a docx template that I am saving as .xml and then parsing the content.
Then I am generating a new updated word document. After the word document is generated I am unable to open it. It says " document corrupt ". I press ok. Then it says " Press OK if Do you want to retrieve the document ". I press ok. Then I get the updated document. This happens everytime. I have created the same program as stand alone java application. The document generated through the stand alone Java application opens without any errors. Could anyone provide me an insight into this ? I have used the same code for the server side also.
Here is the code that I use to generate the docuemnt.
try {
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
FileOutputStream file = new FileOutputStream(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance()
.newTransformer();
xformer.transform(source, result);
file.close();
} catch (TransformerConfigurationException e) {
System.out.println("Transformation Configuration Excepiton in WriteXMLFile");
} catch (TransformerException e) {
System.out.println("Transformation Excepiton in WriteXMLFile");
} catch (Exception e) {
System.out.println("Transformation Excepiton in WriteXMLFile");
e.printStackTrace();
}

I use POI library to generate Word documents (.doc, not .docx but it should work too).
With POI you can :
- open your word document
- edit whatever you want with a clean API (not mess up in XML)
- write the result
http://poi.apache.org/

You can use POI or docx4j to ensure you create valid Word documents.

Had you check the encoding of JVM?. I had have that problem, and finally I discovered that in Eclipse I had UTF-8, but in standalone I didn't specify encoding, so JVM take ISO-8559.
Please, check it with parameter -Dfile.encoding=UTF-8.

I have used both Apache POI and docx4j extensively, and having said that docx4j is more robust as it offers more support out of the box for not only the document itself but for tables and images. A lot of what docx4j does is automated, where areas Apache's POI you have to do a lot of manual coding for docx support. Unfortunately not much has been done for POI for docx support. I would suggest using docx4j as they have native support for opening and saving a new .docx file out of the box.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI Replace String in Word - java

Please see my answer here I used this solution to automate generation of documents and it has been used since more than 1 year on production.

Related

Trying to sign a pdf document with java. Why is the signature invalid in the PDF files?

how to know whether a file is .docx or .doc format from Apache POI

I can't import com.itextpdf.text.Document class

Viewing .doc file with java applet

Generating docx in java

Categories

Resources