How to convert docx to PDF without split tables - java

I have dynamics docx with few tables and I'm trying to convert to a PDF. When I converted to PDF then it covers two pages. I use Apache POI XWPF converter in 2.0.2 version.
In docx file everything is okey but when I convert to PDF then tables are spited
Someone have any idea or better library to convert docx to pdf?
PdfOptions options = PdfOptions.getDefault();
options.fontProvider((familyName, encoding, size, style, color) -> {
try {
BaseFont baseFont = BaseFont.createFont("fonts/times.ttf", encoding, BaseFont.EMBEDDED);
return new Font(baseFont, size, style, color);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
PdfConverter.getInstance().convert(document, out, options);

There is no library to convert a doc[x] file into a completely correctly formatted PDF. The only program that can do that is Word itself.
I have achieved this by using the Word API in a PowerShell script:
$document_path = $args[0]
$document_parent_folder = $args[1]
$file_name = $args[2]
$word_app = New-Object -ComObject Word.Application
$document = $word_app.Documents.Open($document_path)
$pdf_filename = "$($document_parent_folder)\$($file_name)"
$document.SaveAs([ref] $pdf_filename, [ref] 17)
$document.Close()
$word_app.Quit()
Yes it is not the best solution and it is heavily dependent on having Microsoft Office installed in the machine and a lot of other problems that accompany this solution... But it is the only solution that formatted my documents exactly how I wanted them.
The script takes three arguments
The path of the document that will be converted
The folder where it is located
The name of the pdf file

Related

It is possible to use the TessAPI1.TessPDFRendererCreate API of tess4J without needing to create physical files?

I am using the Tesseract Java API (tess4J) to convert Tiff images to PDFs.
This works nicely, but I am forced to write both the source Tiff image and the output PDF to local filestore as actual physical files in order to use the TessAPI1.TessPDFRendererCreate API.
Please note the following in the code snippet below: -
The input Tiff is originally a java.awt.image.BufferedImage, but I have to write it to a physical file (sourceTiffFile is a File object).
I must specify a file path for the output (pdfFullFilepath is a String representing an absolute path for the new PDF file).
try {
ImageIO.write(bufferedImage, "tiff", sourceTiffFile);
} catch (Exception ioe) {
//handling code...
}
TessResultRenderer renderer = TessAPI1.TessPDFRendererCreate(pdfFullFilepath, dataPath, 0);
TessAPI1.TessResultRendererInsert(renderer, TessAPI1.TessPDFRendererCreate(pdfFullFilepath, dataPath, 0));
int result = TessAPI1.TessBaseAPIProcessPages(handle, sourceTiffFile.getAbsolutePath(), null, 0, renderer);
I would really like to avoid creating physical files, but am not sure if it is possible with this API. Ideally, I would like to pass the Tiff as a java.awt.image.BufferedImage or a byte array and receive the output PDF as a byte array.
Any suggestions would be most welcome as always. Thank you :)
You can pass in ProcessPage API method a Pix, which can be converted from a BufferedImage, but the output will still be a physical file. Tesseract API dictates that.
https://tesseract-ocr.github.io/tessapi/4.0.0/a01625.html
http://tess4j.sourceforge.net/docs/docs-4.4/net/sourceforge/tess4j/TessAPI1.html
For ex:
int result = TessAPI1.TessBaseAPIProcessPage(handle, LeptUtils.convertImageToPix(bufferedImage), page_index, "input file name", null, 0, renderer);

I can't import com.itextpdf.text.Document class

I'm building an android app and I want to use iText for creating pdf file, but I can't use Document class. As I seen in tutorials, there should be import com.itextpdf.text.Document for using Document class. For this app, I'm using com.itextpdf:itext-pdfa:5.5.9 library. I want to create a simple pdf file with 2 paragraphs, something like this:
try{
File pdfFolder = new File(Environment.getExternalStoragePublicDirectory(
Environment.DIRECTORY_DOCUMENTS), "pdfdemo");
if (!pdfFolder.exists()) {
pdfFolder.mkdir();
}
Date date = new Date() ;
String timeStamp = new SimpleDateFormat("yyyyMMdd_HHmmss").format(date);
File myFile = new File(pdfFolder + timeStamp + ".pdf");
OutputStream output = new FileOutputStream(myFile);
Document document = new Document();
PdfAWriter.getInstance(document, output);
document.open();
document.add(new Paragraph(mSubjectEditText.getText().toString()));
document.add(new Paragraph(mBodyEditText.getText().toString()));
document.close();
}catch (Exception e) {}
'
Could anyone help me with this problem? What am I doing wrong?
You say:
I'm using com.itextpdf:itext-pdfa:5.5.9 library
That is wrong for two reasons:
itext-pdfa is an addon to iText that is meant for writing or manipulating PDF/A documents. It requires the core iText libary. Read about the different parts of iText on the official web site: https://developers.itextpdf.com/itext-java
You say you want to use iText on Android, but you are referring to iText for Java. iText for Java contains classes that are not allowed on Android (java.awt.*, javax.nio,...). You should use the Android port for iText, which is called iTextG: https://developers.itextpdf.com/itextg-android
It's as if you're using iText without having visited the official iText web site. How is that even possible?
Just open your app level gradle file and add following code into your dependencies
implementation 'com.itextpdf:itext-pdfa:5.5.9'
It works for me

Viewing .doc file with java applet

I have a web application. I've generated MS Word document in xml format (Word 2003 XML Document) on server side. I need to show this document to a user on a client side using some kind of viewer. So, question is: what libraries I can use to solve this problem? I need an API to view word document on client side using java.
You cannot reliably display a Word document in a web page using Java (or any other simple technology for that matter). There are several commercial libraries out there to render Word, but you will not find these to be easy, cheap or reliable solutions.
What you should do is the following:
(1) Open the Word engine on the server using a .NET program
(2) Convert the document to Rich Text using the Word engine
(3) Display the rich text either using the RTF Swing widget, or convert to HTML:
String rtf = [your document rich text];
BufferedReader input = new BufferedReader(new StringReader(rtf));
RTFEditorKit rtfKit = new RTFEditorKit();
StyledDocument doc = (StyledDocument) rtfKit.createDefaultDocument();
rtfEdtrKt.read( input, doc, 0 );
input.close();
HTMLEditorKit htmlKit = new HTMLEditorKit();
StringWriter output = new StringWriter();
htmlKit.write( output, doc, 0, doc.getLength());
String html = output.toString();
The main risk in this approach is that the Word engine will either crash or have a memory leak. For this reason you have to have a mechanism for restarting it periodically and testing it to make sure it is functional and not hogging memory.
docx4all is a Swing-based applet which does Word 2007 XML (ie not Word 2003 XML), which we wrote several years ago.
Get it from svn.
That's a possible approach for editing. If all you want is a viewer, which not convert to HTML or PDF? You can use docx4j for that. (Disclosure: "my" project).
You might have a look at the Apache POI - Java API to Handle Microsoft Word Files which is able to read all kinds of word documents (OLE2 and OOXML formats, .doc and .docx extensions respectively).
Reading a doc file can be easy as:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
You can find more at: HWPF Quick-Guide (specifically HWPF unit tests)
Note that, according to the POI site:
HWPF is still in early development.
I'd suggest looking at the openoffice source code and implement that.
It's supposed to be written in java.

Changing word doc into pdf file

Is there a way to convert .doc file to .pdf keeping the format same as doc file which can also include images?
I am able to generate PDF file from doc but only the text appears.
You can use a library based on Open-Office.
It allows to convert from (and to) all the formats supported by OpenOffice.
Moreover, if your doc is read correctly by OpenOffice, it should be converted exactly as you see.
I know JOD Converter for exemple :
File inputFile = new File("document.doc");
File outputFile = new File("document.pdf");
// connect to an OpenOffice.org instance running on port 8100
OpenOfficeConnection connection = new SocketOpenOfficeConnection(8100);
connection.connect();
// convert
DocumentConverter converter = new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
// close the connection
connection.disconnect();
You can also use a simple command line (with oo installed) :
#!/bin/sh
DIR=$(pwd)
DOC=$DIR/$1
echo "Doc to convert : $DOC"
/user/bin/oowriter-invisible "macro://Standard.Module1.ConvertWordToPDF($DOC)"
You can use Apache POI to read the doc file and then Apache PDFBox to write the pdf file.
You can use Openoffice Macro for export doc as pdf like,
sub Docaspdf
rem ----------------------------------------------------------------------
rem define variables
dim document as object
dim dispatcher as object
rem ----------------------------------------------------------------------
rem get access to the document
document = ThisComponent.CurrentController.Frame
dispatcher = createUnoService("com.sun.star.frame.DispatchHelper")
rem ----------------------------------------------------------------------
dim args1(2) as new com.sun.star.beans.PropertyValue
args1(0).Name = "URL"
args1(0).Value = "file:///C:/doc.pdf"
args1(1).Name = "FilterName"
args1(1).Value = "writer_pdf_Export"
args1(2).Name = "FilterData"
args1(2).Value = Array(Array("UseLosslessCompression",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("Quality",0,90,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ReduceImageResolution",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("MaxImageResolution",0,300,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("UseTaggedPDF",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("SelectPdfVersion",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ExportNotes",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ExportBookmarks",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("OpenBookmarkLevels",0,-1,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("UseTransitionEffects",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("IsSkipEmptyPages",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("IsAddStream",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("EmbedStandardFonts",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("FormsType",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ExportFormFields",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("AllowDuplicateFieldNames",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("HideViewerToolbar",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("HideViewerMenubar",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("HideViewerWindowControls",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ResizeWindowToInitialPage",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("CenterWindow",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("OpenInFullScreenMode",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("DisplayPDFDocumentTitle",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("InitialView",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("Magnification",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("Zoom",0,100,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("PageLayout",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("FirstPageOnLeft",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("InitialPage",0,1,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("Printing",0,2,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("Changes",0,4,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("EnableCopyingOfContent",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("EnableTextAccessForAccessibilityTools",0,true,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ExportLinksRelativeFsys",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("PDFViewSelection",0,0,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ConvertOOoTargetToPDFTarget",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("ExportBookmarksToPDFDestination",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("_OkButtonString",0,"",com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("EncryptFile",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("PreparedPasswords",0,,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("RestrictPermissions",0,false,com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("PreparedPermissionPassword",0,Array(),com.sun.star.beans.PropertyState.DIRECT_VALUE),Array("",0,,com.sun.star.beans.PropertyState.DIRECT_VALUE))
dispatcher.executeDispatch(document, ".uno:ExportToPDF", "", 0, args1())
end sub
import officetools.OfficeFile;
FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.pdf")); /
OfficeFile f = new OfficeFile(fis,"localhost","8100", false);
convert to pdf
f.convert(fos,"pdf");
You may use Aspose.Words for Java to convert Doc files to PDF. This component retains the format of the word document when converted to PDF. It also converts the images along with text.
Disclosure: I work as developer evangelist at Aspose.

Generating docx in java

I have a docx template that I am saving as .xml and then parsing the content.
Then I am generating a new updated word document. After the word document is generated I am unable to open it. It says " document corrupt ". I press ok. Then it says " Press OK if Do you want to retrieve the document ". I press ok. Then I get the updated document. This happens everytime. I have created the same program as stand alone java application. The document generated through the stand alone Java application opens without any errors. Could anyone provide me an insight into this ? I have used the same code for the server side also.
Here is the code that I use to generate the docuemnt.
try {
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
FileOutputStream file = new FileOutputStream(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance()
.newTransformer();
xformer.transform(source, result);
file.close();
} catch (TransformerConfigurationException e) {
System.out.println("Transformation Configuration Excepiton in WriteXMLFile");
} catch (TransformerException e) {
System.out.println("Transformation Excepiton in WriteXMLFile");
} catch (Exception e) {
System.out.println("Transformation Excepiton in WriteXMLFile");
e.printStackTrace();
}
I use POI library to generate Word documents (.doc, not .docx but it should work too).
With POI you can :
- open your word document
- edit whatever you want with a clean API (not mess up in XML)
- write the result
http://poi.apache.org/
You can use POI or docx4j to ensure you create valid Word documents.
Had you check the encoding of JVM?. I had have that problem, and finally I discovered that in Eclipse I had UTF-8, but in standalone I didn't specify encoding, so JVM take ISO-8559.
Please, check it with parameter -Dfile.encoding=UTF-8.
I have used both Apache POI and docx4j extensively, and having said that docx4j is more robust as it offers more support out of the box for not only the document itself but for tables and images. A lot of what docx4j does is automated, where areas Apache's POI you have to do a lot of manual coding for docx support. Unfortunately not much has been done for POI for docx support. I would suggest using docx4j as they have native support for opening and saving a new .docx file out of the box.

Categories