Apache POI - Docx output issue - java

I am evaluating apache poi as an option to write docx files. The specific thing I am looking for is to generate content in the docx file in different languages (hindi/marathi to be specific). I am facing the following issue:
When the docx file gets written the "Hindi/Marathi" text is visible as square boxes even though the font "Arial Unicode MS" supports it. The point is that when we check the boxes MS Word displays the font as "Cailbri", even though i have explicitly set the font to "Arial Unicode MS". If i select the boxes in MS Word and then change the font to "Arial Unicode MS" the hindi/marathi words are visible correctly. Any idea why this happens? Please note I am using a development version of POI as the previous stable version did not support setting of font families. Here is the source:
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
public class CreateDocumentFromScratch
{
public static void main(String[] args)
{
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraphTwo = document.createParagraph();
XWPFRun paragraphTwoRunOne = paragraphTwo.createRun();
paragraphTwoRunOne.setFontFamily("Arial Unicode MS");
paragraphTwoRunOne.setText("नसल्यास");
XWPFParagraph paragraphThree = document.createParagraph();
XWPFRun paragraphThreeRunOne = paragraphThree.createRun();
paragraphThreeRunOne.setFontFamily("Arial Unicode MS");
paragraphThreeRunOne.setText("This is nice");
FileOutputStream outStream = null;
try {
outStream = new FileOutputStream("c:/will/First.doc");
} catch (FileNotFoundException e) {
e.printStackTrace();
}
try {
document.write(outStream);
outStream.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Any help will be appreciated.

To resurrect a very old post; can the OP confirm the version of MS Office that being used? The problem appears to be with MS Office 2003 running on Windows XP. But then it could be on a higher OS version, too.
It would appear that MS Word applies the Mangal font for Hindi script [Encoding standard: Indic: Hindi ISCII 57002 (Devanagari)]. The following link explains this:
https://support.office.com/en-ca/article/Choose-text-encoding-when-you-open-and-save-files-60d59c21-88b5-4006-831c-d536d42fd861
Suggested workaround:
From Windows XP Control Panel, select Regional and Language Options. Select Languages. Check the box "Install files for complex script and right-to-left languages (including Thai).
Restart PC.
However, no such problem was observed when opening the file using LibreOffice versions 4.3.5.2 on Windows, and LibreOffice 4.2.7.2 on Linux (Ubuntu).
Used the following libraries:
poi-3.10-FINAL-20140208.jar, poi-ooxml-3.10-FINAL-20140208.jar,
poi-ooxml-schemas-3.10-FINAL-20140208.jar, xmlbeans-2.3.0.jar,
dom4j-1.6.1.jar, stax-api-1.0.1.jar

Related

How to convert docx to PDF without split tables

I have dynamics docx with few tables and I'm trying to convert to a PDF. When I converted to PDF then it covers two pages. I use Apache POI XWPF converter in 2.0.2 version.
In docx file everything is okey but when I convert to PDF then tables are spited
Someone have any idea or better library to convert docx to pdf?
PdfOptions options = PdfOptions.getDefault();
options.fontProvider((familyName, encoding, size, style, color) -> {
try {
BaseFont baseFont = BaseFont.createFont("fonts/times.ttf", encoding, BaseFont.EMBEDDED);
return new Font(baseFont, size, style, color);
} catch (Exception e) {
throw new RuntimeException(e);
}
});
PdfConverter.getInstance().convert(document, out, options);
There is no library to convert a doc[x] file into a completely correctly formatted PDF. The only program that can do that is Word itself.
I have achieved this by using the Word API in a PowerShell script:
$document_path = $args[0]
$document_parent_folder = $args[1]
$file_name = $args[2]
$word_app = New-Object -ComObject Word.Application
$document = $word_app.Documents.Open($document_path)
$pdf_filename = "$($document_parent_folder)\$($file_name)"
$document.SaveAs([ref] $pdf_filename, [ref] 17)
$document.Close()
$word_app.Quit()
Yes it is not the best solution and it is heavily dependent on having Microsoft Office installed in the machine and a lot of other problems that accompany this solution... But it is the only solution that formatted my documents exactly how I wanted them.
The script takes three arguments
The path of the document that will be converted
The folder where it is located
The name of the pdf file

Chinese Characters Are Misspelled in OI Operations

I try to write chinese character but take a wrong result
For Instance :
import java.io.*;
import java.nio.*;
class x {
public static void main(String... args) throws Exception {
OutputStreamWriter outputStreamWriter =
new OutputStreamWriter(new FileOutputStream(new File("practice.csv"), true), "GBK");
outputStreamWriter.write("常用场景");
outputStreamWriter.write("Helo World!");
outputStreamWriter.flush();
outputStreamWriter.close();
}
}
Response : ????¡±¡§??????Helo World!
I tried to change charset utf-8, utf-16 but it doesn't anything and lastly I tried to add BufferedWriter but unfortunately it doesn't anything again.
then I considered to change csv to txt, but again same result. What am I doing wrong ?
I found it finally. Firstly very thanks for helping #Kayaman and #user16320675.
In fact, everything was correct. This problem's resource is csv files is opened by excel. When you want to open csv files directly in excel, it opens according to the encoding of the current computer language. We just have a option in Windows 10 EN(manually Data Import). I used the windows 10 EN and excel uses ANSI for windows 10 EN.
My Solution : I added to chinese language pack to my windows 10 computer and I changed the excel editing language (chinese for default) and everything worked.

Creating Word File With JTextPane Style Option

I want to save the contents of a JTextPane to a word file.
I don't have a problem saving but I can't currently keep some style options such as paragraph styles.
I use these libraries:
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFRun;
Lines of code;
System.out.println("Kaydete basıldı");
String text = textPane.getText();
lblNewLabel.setText(text);
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run = paragraph.createRun();
run.setText(text);
try {
FileOutputStream dosyaCikis = new FileOutputStream(
"sercan.docx");
document.write(dosyaCikis);
dosyaCikis.close();
} catch (Exception e2) {
e2.printStackTrace();
}
Apache POI or another way, it does not matter, I am waiting for your help.
This example shows how to set various style options:(Apache POI)
SimpleDocument
Example code form the link:
XWPFDocument doc = new XWPFDocument();
XWPFParagraph p1 = doc.createParagraph();
p1.setAlignment(ParagraphAlignment.CENTER);
p1.setBorderBottom(Borders.DOUBLE);
p1.setBorderTop(Borders.DOUBLE);
p1.setBorderRight(Borders.DOUBLE);
p1.setBorderLeft(Borders.DOUBLE);
p1.setBorderBetween(Borders.SINGLE);
p1.setVerticalAlignment(TextAlignment.TOP);
XWPFRun r1 = p1.createRun();
r1.setBold(true);
r1.setText("The quick brown fox");
r1.setBold(true);
r1.setFontFamily("Courier");
r1.setUnderline(UnderlinePatterns.DOT_DOT_DASH);
r1.setTextPosition(100);
Other examples(styles,images .etc) can be found here:
Example Package
AFAIK the options for writing Word files are limited from standard Java libraries.
You probably want to use a tool that explicitly supports Word formats - the best bet is probably LibreOffice, which is Free software. The LibreOffice API supports Java and other languages.
For a fuller explanation look here:
What's a good Java API for creating Word documents?
However that answer refers to OpenOffice, of which LibreOffice is a more actively developed fork due to management issues over the years.
You could try docx_editor_kit. From the web page:
it can open docx file and reflect the content in JEditorPane (or
JTextPane). Also user can create styled content and store the content
in docx format.
Somewhat related is my docx4all, but it hasn't been updated recently, and it may be overkill for your purposes.
Both of these use docx4j (as opposed to POI).

Adding fonts to Apache Pdfbox?

Is there a way to add additional font styles into Apache Pdfbox?
We're currently trying to work around printing PDFs in our system (currently being done with PDF-Renderer.) I have been looking at various alternatives (pdfbox, jpedal, jPDFPrint)
Our hope is for a free GPL compatible library to use, and as such we're leaning towards pdfbox. I have been able to write some sample code to print out the pdf which 'works'. See below:
PDDocument doc;
try {
doc = PDDocument.load("test.pdf");
doc.print();
} catch (Exception e) {
// Come up with better thing to do on fail.
e.printStackTrace();
}
As I mentioned, this works but the problem I'm running into is that PdfBox doesn't seem to be recognizing the fonts used in the pdf, and as such changes the font being used. As a result the document looks very odd (spacing and character size are different and look bizarre). I routinely see the following log message, or things like it:
Apr 16, 2014 2:56:21 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
WARNING: Changing font on < > from < NimbusMono > to the default font
Does anyone know of a way (or a reference) on how to approach adding a new fonttype into pdfbox? Or barring that, how to change the default font type?
From what I can tell, pdfbox supports 14 standard fonts. Unfortunately NimbusMono is not one of them. Any guidance would be appreciated.
The unreleased 2.0 version supports the rendering of embedded fonts. You can get it as a snapshot
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/
or through "svn checkout http://svn.apache.org/repos/asf/pdfbox/trunk/". The API is slightly different from the 1.8.x versions and might change, the best is to look at the code examples. A quick test to see whether your file will be rendered properly is to download the "pdfbox-app"
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/
and then run the viewer:
java -jar pdfbox-app-2.0.0-20140416.173452-273.jar PDFReader your-file-name.pdf
There's also a print feature.
Good luck!
Update 2016: 2.0 release is out, download it here.
If you have used the 1.8 version, read the migration guide.
I came across this post while trying to solve the same problem. The PDFBox 2.0 API documentation isn't great at the moment.
What you're looking for is the FontFileFinder in Fontbox.
Make sure you're using the full pdfbox-app jar which includes Fontbox.
I've only tried this on Windows but looking at the classes it seems like it supports the other main operating systems.
Here's a simple example class I wrote that writes out a small bit of text in the bottom left corner of a PDF, using a non-standard font.
import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.util.List;
import org.apache.fontbox.util.autodetect.FontFileFinder;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
public class TestPDFWrite {
public static void main(String[] args) throws IOException {
FontFileFinder fontFinder = new FontFileFinder();
List<URI> fontURIs = fontFinder.find();
File fontFile = null;
for (URI uri : fontURIs) {
File font = new File(uri);
if (font.getName().equals("CHILLER.TTF")) {
fontFile = font;
}
}
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage(page);
PDPageContentStream contentStream = new PDPageContentStream(document, page);
contentStream.beginText();
if (fontFile != null) {
contentStream.setFont(PDType0Font.load(document, fontFile), 12);
} else {
contentStream.setFont(PDType1Font.HELVETICA, 12);
}
contentStream.newLineAtOffset(10, 10);
contentStream.showText("Hello World");
contentStream.endText();
contentStream.close();
document.save("C:/Hello World.pdf");
document.close();
}
}
I ran into a similar problem with PDFBox. PDFs can be printed in a straightforward way using Java's javax.print package. The following code is slightly modified from the API docs for javax.print.
DocFlavor flavor = DocFlavor.INPUT_STREAM.PDF;
PrintRequestAttributeSet aset = new HashPrintRequestAttributeSet();
aset.add(MediaSizeName.ISO_C6); //letter size
PrintService[] pservices = PrintServiceLookup.lookupPrintServices(flavor, aset);
if (pservices.length > 0) {
DocPrintJob pj = pservices[0].createPrintJob();
try {
FileInputStream fis = new FileInputStream("test.pdf");
Doc doc = new SimpleDoc(fis, flavor, null);
pj.print(doc, aset);
} catch (FileNotFoundException | PrintException e) {
//do something
}
This code assumes that the printer can accept a PDF directly but it allows you to bypass PDFBox 1.8 branch's wonky font issues.

Viewing .doc file with java applet

I have a web application. I've generated MS Word document in xml format (Word 2003 XML Document) on server side. I need to show this document to a user on a client side using some kind of viewer. So, question is: what libraries I can use to solve this problem? I need an API to view word document on client side using java.
You cannot reliably display a Word document in a web page using Java (or any other simple technology for that matter). There are several commercial libraries out there to render Word, but you will not find these to be easy, cheap or reliable solutions.
What you should do is the following:
(1) Open the Word engine on the server using a .NET program
(2) Convert the document to Rich Text using the Word engine
(3) Display the rich text either using the RTF Swing widget, or convert to HTML:
String rtf = [your document rich text];
BufferedReader input = new BufferedReader(new StringReader(rtf));
RTFEditorKit rtfKit = new RTFEditorKit();
StyledDocument doc = (StyledDocument) rtfKit.createDefaultDocument();
rtfEdtrKt.read( input, doc, 0 );
input.close();
HTMLEditorKit htmlKit = new HTMLEditorKit();
StringWriter output = new StringWriter();
htmlKit.write( output, doc, 0, doc.getLength());
String html = output.toString();
The main risk in this approach is that the Word engine will either crash or have a memory leak. For this reason you have to have a mechanism for restarting it periodically and testing it to make sure it is functional and not hogging memory.
docx4all is a Swing-based applet which does Word 2007 XML (ie not Word 2003 XML), which we wrote several years ago.
Get it from svn.
That's a possible approach for editing. If all you want is a viewer, which not convert to HTML or PDF? You can use docx4j for that. (Disclosure: "my" project).
You might have a look at the Apache POI - Java API to Handle Microsoft Word Files which is able to read all kinds of word documents (OLE2 and OOXML formats, .doc and .docx extensions respectively).
Reading a doc file can be easy as:
import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public class ReadDocFile {
public static void main(String[] args) {
File file = null;
WordExtractor extractor = null ;
try {
file = new File("c:\\New.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
if(fileData[i] != null)
System.out.println(fileData[i]);
}
}
catch(Exception exep){}
}
}
You can find more at: HWPF Quick-Guide (specifically HWPF unit tests)
Note that, according to the POI site:
HWPF is still in early development.
I'd suggest looking at the openoffice source code and implement that.
It's supposed to be written in java.

Categories