Convert Word Document to PDF using Java - java

I want to convert ms word document to PDF file using POI.jar(read the MS word Content) and Itext.jar(Creat the PDF File).
For Plain text in MS word, I am able to conver into PDF. But I have few images on ms word. I want to put those images on PDF.
Could some please help me out?

You lucky man i just stumbled upon JODConverter it uses openoffice to covert through java and its very easy to use.

There isn't such a solution for free, you will have to buy something like Aspose components, but you can also save the Word document as HTML and use any of the available HTML-to-PDF tools to convert it to PDF using Java. One of them is wkhtmltopdf.

Related

Tess4j separate pages from pdf with multiple pages in Java Application

I have added Tess4j to my Java Application everything works great my PDF document with more than 50 pages is properly recognized and read and written to a text document as a string.
The problem is how can I mark end of each single page from my PDF file in my text document? By for example special string like ("++##- END--##++") which does not occur on the pages of PDF document???
Is this even possible with Tess4j?
Thank you
If you use createDocuments method or the low-level TessBoxTextRenderer API, the output text file will contain page-separator characters, which are FF by default.
https://tesseract-ocr.github.io/tessdoc/FAQ.html#what-page-separators-are-used-in-txt-output-by-tesseract-400
http://tess4j.sourceforge.net/docs

PDFBox: convert PDF to text including chapter headlineinformation

I am currently working at a project to extract the content of pdf files and search for certain keywords in them.
For extracting the content I am using PDFBox and this works fine.
The problem I now have encountered is that I want to be able to search for certain keywords only within chapter headlines.
At the moment my code for extracting looks like this:
PDDocument doc = PDDocument.load(pdfFile);
String text = new PDFTextStripper().getText(doc);
doc.close();
This only extracts the raw text of the file, with no information about headlines. I was not able to figure out how to use PDFBox to include such information. So I am not sure if this is even possible.
Has anybody experience with this tool and can tell me, if its even possible to do this by using PDFBox and if yes, how I will be able to achieve this?
Kind regards

Insert PDF file into MSWord using Java POI API

I am using apache-poi api to create a word document.
I want to insert a PDF document into MS Word (normally we use Insert -> Object -> Create from file option in MS word to do this).
Is is possible to insert the PDF as an object into MS Word using Java ?
Regards,
Suthershan
To solve this you have to use OLE. After a quick research I found no example code for WORD but I found some code for EXCEL and an example for PPTX. Maybe it is helpfull to write some corresponding code for WORD.

How to append/paste bufferedImage into a Word or RTF document using Java?

I created a Microsoft Word document and tried to write the buffered image to it but all I got was garbled text. Is there a way to write (preferably append) a buffered image to a doc or RTF file?
I want to avoid using docx4j or iText or any external package for that matter due to some constraints. But if there is no other way then please do let me know.
My code in case anyone needs for reference:
ps_file = new File("ps_file.doc");
ImageIO.write(i1, "jpg", ps_file);
Word Documents have their own syntax to store their data so you can't just append text to them and expect it to just work.
You will have to use a 3rd party library unless if you're willing to reinvent the car.
You can however create an RTF file which stores the image. There's a question similar to it that's been answered here:
Programmatically adding Images to RTF Document
Obviously it's for C# but the same procedures can easily be applied in Java.

Clipboard format for DOCX data

My Java application generates a document in DOCX format using DocX4J. I need to send it into clipboard to be pasted in Word. I know that Word will consume HTML, but I rather not convert DOCX to HTML (I am not sure if DocX4J supports it and I rather not loose any formatting). What clipboard format (in Java terms DataFlavor) can I use to send DOCX data to clipboard so Word will understand it?
I am doing a similar thing with OpenOffice document and for that I use
DataFlavor odtFlavor = new DataFlavor("application/x-openoffice-embed-source-xml;"+
"representationclass=java.io.InputStream");
How should I represent the DOCX document itself? In case of OpenOffice ODT I pass the InputStream created from the ODT file.
I believe a similar question has been asked by David Thielen here: What are the clipboard formats for Microsoft Office where you can drop a chart? but there are no answers.
Worst case, docx4j can export to HTML, so you could do that.
Or you could use RTF. docx4j uses FOP to create PDF, so you could use
the XSL FO output to create RTF (FOP can do that - your mileage may
vary).
Not sure which of these will give you better quality. Possibly the
HTML (though what happens to images?).
Or you could make a basic docx to RTF converter.
There may be a way to use the docx format.
If you copy from Word, and look at it in ClipSpy (binary available in
the source download from CodeProject), you'll see "Embed Source" is
the data as a docx in OLE.
how to reload saved "Embed Source" clipboard data?
says you can write your own "Embed Source" by passing
Clipboard.SetData a stream object
Seems to depend whether you want to copy/paste or drag/drop though.
See further your link What are the clipboard formats for Microsoft Office where you can drop a chart? and
http://social.msdn.microsoft.com/Forums/en/worddev/thread/84263fb9-61c2-424a-a294-a12f69fd6b1b

Categories