How to create a PDF/DOCX files in Java/Scala?

How to create a PDF/DOCX files in Java/Scala? - java

I am creating a web application which will accept some inputs from user (like name, age, address etc) and generate some predefined forms with filled information for user to download and print.
For example, an Application Form for driving license or something along those lines. The backend will have the format information about the document to be generated and other information will be gathered from user from front-end.
I am going to use Play Framework 2.5 for this and Java/Scala as programming language. But right now I am not aware if there are any free libraries/APIs that I can use to achieve this document generation.
I should be able to manipulate the font size, style, indentations, paragraphs, page borders, page numbers, alignments, document headers and footers, page size (A4, Legal etc) some other basic stuff. And I need documents in format that are widely supported for editing and printing purposes. Like PDF, DOCX for example. DOCX is preferred so user can edit something after downloading the document before taking a print out.

I have used the apache POI library to parse and create ms word documents (including docx) files:
http://www.tutorialspoint.com/apache_poi_word/apache_poi_word_quick_guide.htm
It's not amazing but it's the best I've found :)

I have used docx4j.jar which simply converts xhtml to docx.
What you can do for your requirement is save your format information as xhtml template and place input from form (like name,age,address etc) into the template at runtime.
This is a sample code to refer from this link
public static void main(String[] args) throws Exception
{
String xhtml=
"<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\" style=\"width:100%;\"><tbody><tr><td>test</td><td>test</td></tr><tr><td>test</td><td>test</td></tr><tr><td>test</td><td>test</td></tr></tbody></table>";
// To docx, with content controls
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage();
XHTMLImporterImpl XHTMLImporter = new XHTMLImporterImpl(wordMLPackage);
wordMLPackage.getMainDocumentPart().getContent().addAll(
XHTMLImporter.convert( xhtml, null) );
wordMLPackage.save(new java.io.File("D://sample.docx"));
}

Related

Replacing text in XWPFParagraph without changing format of the docx file

I am developing font converter app which will convert Unicode font text to Krutidev/Shree Lipi (Marathi/Hindi) font text. In the original docx file there are formatted words (i.e. Color, Font, size of the text, Hyperlinks..etc. ).
I want to keep format of the final docx same as the original docx after converting words from Unicode to another font.
PFA.
Here is my Code
try {
fileInputStream = new FileInputStream("StartDoc.docx");
document = new XWPFDocument(fileInputStream);
XWPFWordExtractor extractor = new XWPFWordExtractor(document);
List<XWPFParagraph> paragraph = document.getParagraphs();
Converter data = new Converter() ;
for(XWPFParagraph p :document.getParagraphs())
{
for(XWPFRun r :p.getRuns())
{
String string2 = r.getText(0);
data.uniToShree(string2);
r.setText(string2,0);
}
}
//Write the Document in file system
FileOutputStream out = new FileOutputStream(new File("Output.docx");
document.write(out);
out.close();
System.out.println("Output.docx written successully");
}
catch (IOException e) {
System.out.println("We had an error while reading the Word Doc");
}

Thank you for ask-an-answer.
I have worked using POI some years ago, but over excel-workbooks, but still I’ll try to help you reach the root cause of your error.
The Java compiler is smart enough to suggest good debugging information in itself!
A good first step to disambiguate the error is to not overwrite the exception message provided to you via the compiler complain.
Try printing the results of e.getLocalizedMessage()or e.getMessage() and see what you get.
Getting the stack trace using printStackTrace method is also useful oftentimes to pinpoint where your error lies!
Share your findings from the above method calls to further help you help debug the issue.
[EDIT 1:]
So it seems, you are able to process the file just right with respect to the font conversion of the data, but you are not able to reconstruct the formatting of the original data in the converted data file.
(thus, "We had an error while reading the Word Doc", is a lie getting printed ;) )
Now, there are 2 elements to a Word document:
Content
Structure or Schema
You are able to convert the data as you are working only on the content of your respective doc files.
In order to be able to retain the formatting of the contents, your solution needs to be aware of the formatting of the doc files as well and take care of that.
MS Word which defined the doc files and their extension (.docx) follows a particular set of schemas that define the rules of formatting. These schemas are defined in Microsoft's XML Namespace packages[1].
You can obtain the XML(HTML) format of the doc-file you want quite easily (see steps in [1] or code in link [2]) and even apply different schemas or possibly your own schema definitions based on the definitions provided by MS's namespaces, either programmatically, for which you need to get versed with XML, XSL and XSLT concepts (w3schools[3] is a good starting point) but this method is no less complex than writing your own version of MS-Word; or using MS-Word's inbuilt tools as shown in [1].
[1]. https://www.microsoftpressstore.com/articles/article.aspx?p=2231769&seqNum=4#:~:text=During%20conversion%2C%20Word%20tags%20the,you%20can%20an%20HTML%20file.
[2]. https://svn.apache.org/repos/asf/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/converter/TestWordToHtmlConverter.java
[3]. https://www.w3schools.com/xml/
My answer provides you with a cursory overview of how to achieve what you want to, but depending on your inclination and time availability, you may want to use your discretion before you decide to head onto one path than the other.
Hope it helps!

How to add custom XML storage part to Word doc - preferrably with docx4j

I'm trying to populate a Word content control with XML data using docx4j (version 3.2.1). I'm evaluating this in order to use it for invoice generation. The documents we want to generate are not very complicated so this looks like a good approach to me.
I have created the content control through Word 2010 dev tools. This is how I try to inject the XML into the docx (taken from this example):
WordprocessingMLPackage wordMLPackage = Docx4J.load(new File(input_DOCX));
FileInputStream xmlStream = new FileInputStream(new File(input_XML));
Docx4J.bind(wordMLPackage, xmlStream, Docx4J.FLAG_BIND_INSERT_XML & Docx4J.FLAG_BIND_BIND_XML);
I get the following exception:
org.docx4j.openpackaging.exceptions.Docx4JException: Couldn't find CustomXmlDataStoragePart! exiting..
at org.docx4j.Docx4J.bind(Docx4J.java:300)
at org.docx4j.Docx4J.bind(Docx4J.java:271)
How can I add the CustomXmlDataStoragePart with docx4j, if it doesn't exist yet? Or should/can I do this in Word directly?
Note: I decided to prepare templates in Word directly, because later on these templates must be edited by non-technical users and I don't want to burden them with extra tools, if possible.

You say you "created the content control through Word 2010 dev tools". Unless you mean the content control toolkit, you need to use that or better, either of the OpenDoPE Word addins. Not both.
These tools add a custom xml part into the docx, and allow you to associate it with your content controls via XPath data bindings.
Then, when at runtime you invoke Docx4J.bind, docx4j finds that existing custom xml part, and replaces it with the xml file you provide which contains your runtime data.

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

I have a solution that inserts strings into an XHTML document and prints the results as Reports. My employer has asked if we could pull images off their SQL database (stored as byte arrays) to insert into the Reports.
I am using FlyingSaucer as the XHTML interpreter and I've been using Java DOM to modify pre-stored reports that I have stored in the Report Generator's package.
The only solution I can think of at the moment is to construct the images, save them as a file, link the file in an img tag (or background-image) in a constructed report, print the report and then delete the file. This seems really sloppy and I imagine it will be very time consuming.
I can't help but feel there must be a more elegant solution. Any suggestions for inserting a byte array into html?

Read the image and convert it into it's Base64-encoded form:
InputStream image = getClass().getClassLoader().getResourceAsStream("image.png");
String encodedImage = BaseEncoding.base64().encode(ByteStreams.toByteArray(image));
I've used BaseEncoding and ByteStreams from Google Guava.
Change src attribute of img element within your Document object.
Document doc = ...; // get Document from XHTMLPanel.getDocument() or create
// new one using DocumentBuilderFactory
doc.getElementById("myImage").getAttributes().getNamedItem("src").setNodeValue("data:image/png;base64," + encodedImage);
Unfortunatley FlyingSaucer does not support DataURIs out-of-the-box so you'll have to create your own ReplacedElementFactory. Read Using Data URLs for embedding images in Flying Saucer generated PDFs article - it contains a complete solution.

Writing to a PDF from inside a GAE app

I need to read several megabytes (raw text strings) out of my GAE Datastore and then write them all to a new PDF file, and then make the PDF file available for the user to download.
I am well aware of the sandbox restrictions that prevent you from writing to the file system. I am wondering if there is a crafty way of creating a PDF in-memory (or a combo of memory and the blobstore) and then storing it somehow so that the client-side (browser) can actually pull it down as a file and save it locally.
This is probably a huge stretch, but my only other option is to farm this task out to a non-GAE server, which I would like to avoid at all cost, even if it takes a lot of extra development on my end. Thanks in advance.

You can definitely achieve your use case using GAE itself. Here are the steps that you should follow at a high level:
Download the excellent iText library, which is a Java library to work with PDFs. First build out your Java code to generate the PDF content. Check out various examples at : http://itextpdf.com/book/toc.php
Since you cannot write to a file directly, you need to generate your PDF content in bytes and then write a Servlet which will act as a Download Servlet. The Servlet will use the Response object to open a stream, manipulate the Mime Headers (filename, filetype) and write the PDF contents to the stream. A browser will automatically present a download option when you do that.
Your Download Servlet will have high level code that looks like this:
public class DownloadPDF extends HttpServlet {
public void doGet(HttpServletRequest req, HttpServletResponse res)
throws ServletException, IOException {
//Extract some request parameters, fetch your data and generate your document
String fileName = "<SomeFileName>.pdf";
res.setContentType("application/pdf");
res.setHeader("Content-Disposition", "attachment;filename=\"" + fileName + "\"");
writePDF(<SomeObjectData>, res.getOutputStream());
}
}
}
Remember the writePDF method above is your own method, where you use iText libraries Document and other classes to generate the data and write it ot the outputstream that you have passed in the second parameter.

While I'm not aware of the PDF generation on Google App Engine and especially in Java, but once you have it you can definitely store it and later serve it.
I suppose the generation of the PDF will take more than 30 seconds so you will have to consider using Task Queue Java API for this process.
After you have the file in memory you can simply write it to the Blobstore and later serve it as a regular blob. In the overview you will find a fully functional example on how to upload, write and serve your binary data (blobs) on Google App Engine.

I found a couple of solutions by googling. Please note that I have not actually tried these libraries, but hopefully they will be of help.
PDFJet (commercial)
Write a Google Drive document and export to PDF

How to automate PDF form-filling in Java

I am doing some "pro bono" development for a food pantry near where I live. They are inundated with forms and paperwork, and I would like to develop a system that simply reads data from their MySQL server (which I set up for them on a previous project) and feeds data into PDF versions of all the forms they are required to fill out. This will help them out enormously and save them a lot of time, as well as get rid of a lot of human errors that are made when filling out these forms.
Not knowing anything about the internals of PDF files, I can foresee two avenues here:
Harder Way: It is possible to scan a paper document, turn it into a PDF, and then have software that "fills out" the PDF simply by saying "add text except blah to the following (x,y) coordinates..."; or
Easier Way: PDF specification already allows for the construct of "fields" that can be filled out; this way I just write code that says "add text excerpt blah to the field called *address_value*...", etc.
So my first question is: which of the two avenues am I facing? Does PDF have a concept of "fields" or do I need to "fill out" these documents by telling the PDF library the pixel coordinates of where to place data?
Second, I obviously need an open source (and Java) library to do this. iText seems to be a good start but I've heard it can be difficult to work with. Can anyone lend some ideas or general recommendations here? Thanks in advance!

You can easily merge data into PDF's fields using the FDF(Form Data Format) technology.
Adobe provides a library to do that : Acrobat Forms Data Format (FDF) Toolkit
Also Apache PDFBox can be used to do that.

Please take a look at the chapter about interactive forms in the free ebook The Best iText Questions on StackOverflow. It bundles the answers to questions such as:
How to fill out a pdf file programatically?
How can I flatten a XFA PDF Form using iTextSharp?
Checking off pdf checkbox with itextsharp
How to continue field output on a second page?
finding out required fields to fill in pdf file
and so on...
Or you can watch this video where I explain how to use forms for reporting step by step.
See for instance:
public void manipulatePdf(String src, String dest) throws DocumentException, IOException {
PdfReader reader = new PdfReader(src);
PdfStamper stamper = new PdfStamper(reader,
new FileOutputStream(dest));
AcroFields fields = stamper.getAcroFields();
fields.setField("name", "CALIFORNIA");
fields.setField("abbr", "CA");
fields.setField("capital", "Sacramento");
fields.setField("city", "Los Angeles");
fields.setField("population", "36,961,664");
fields.setField("surface", "163,707");
fields.setField("timezone1", "PT (UTC-8)");
fields.setField("timezone2", "-");
fields.setField("dst", "YES");
stamper.setFormFlattening(true);
stamper.close();
reader.close();
}

public void fillPDF()
{
try {
PDDocument pDDocument = PDDocument.load(new File("D:/pdf/pdfform.pdf")); // pdfform.pdf is input file
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField field = pDAcroForm.getField("Given Name Text Box");
field.setValue("firstname");
field = pDAcroForm.getField("Family Name Text Box");
field.setValue("lastname");
field = pDAcroForm.getField("Country Combo Box");
field.setValue("Country");
System.out.println("country combo" );
field = pDAcroForm.getField(" Driving License Check Box");
field = pDAcroForm.getField("Favourite Colour List Box");
System.out.println("country combo"+ field.isRequired());
pDDocument.save("D:/pdf/pdf-java-output.pdf");
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to create a PDF/DOCX files in Java/Scala? - java

I have used the apache POI library to parse and create ms word documents (including docx) files: http://www.tutorialspoint.com/apache_poi_word/apache_poi_word_quick_guide.htm It's not amazing but it's the best I've found :)

Related

Replacing text in XWPFParagraph without changing format of the docx file

How to add custom XML storage part to Word doc - preferrably with docx4j

IO Issue - Byte Array Image into XHTML(FlyingSaucer)

Writing to a PDF from inside a GAE app

How to automate PDF form-filling in Java

Categories

Resources