Unable to save Arabic words in a PDF - PDFBox Java

Unable to save Arabic words in a PDF - PDFBox Java - java

Trying to save Arabic words in an editable PDF. It works all fine with English ones but when I use Arabic words, I am getting this exception:
java.lang.IllegalArgumentException:
U+0627 is not available in this font Helvetica encoding: WinAnsiEncoding
Here is how I generated PDF:
public static void main(String[] args) throws IOException
{
String formTemplate = "myFormPdf.pdf";
try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
{
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
field.setValue("جملة");
}
pdfDocument.save("updatedPdf.pdf");
}
}

That's how I made it work, I hope it would help others. Just use the font that is supported by the language that you want to use in the PDF.
public static void main(String[] args) throws IOException
{
String formTemplate = "myFormPdf.pdf";
try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
{
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
// you can read ttf from resources as well, this is just for testing
PDFont font = PDType0Font.load(pdfDocument,new File("/path/to/font.ttf"));
String fontName = acroForm.getDefaultResources().add(pdfont).getName();
if (acroForm != null)
{
PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
field.setDefaultAppearance("/"+fontName +" 0 Tf 0 g");
field.setValue("جملة");
}
pdfDocument.save("updatedPdf.pdf");
}
}
Edited: Adding the comment of mkl
The font name and the font size are parameters of the Tf instruction, and the gray value 0 for black is the parameter for the g instruction. Parameters and instruction names must be appropriately separated.

You need a font which supports those Arabic symbols.
Once you've got a compatible font, you can load it using PDType0Font
final PDFont font = PDType0Font.load(...);
A Type 0 font is a font which references multiple other fonts' formats, and can, potentially, load all available symbols.
See also the Cookbook - working with fonts (no examples with Type 0, but still useful).

Related

How to use TTF font with PDFBox AcroForm and then flatten document?

I have been trying to make a fillable PDF file with LibreOffice Writer 7.2.2.2. Here is how the document looks like:
All fields right of the vertical lines are form textboxes, each one having its own name(tbxOrderId, tbxFullName...). Each textbox uses SF Pro Text Light as font. Only the one on the bottom right(tbxTotal) - Total €123.00 has Oswald Regular. The document looks alright when I fill these fields with LibreOffice Writer.
Below this are my export settings. I chose Archive PDF A-2b in order to embed the fonts into the document.
Here is the output when I run pdffonts to the exported PDF file.
However, when I run the following code which just changes the values of tbxOrderId and tbxTotal, the output PDF document is missing these fonts.
public class Start {
public static void main(String[] args) {
try {
PDDocument pDDocument = PDDocument.load(new File("/media/stoyank/Elements/Java/tmp/Receipt.pdf"));
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField field = pDAcroForm.getField("tbxOrderId");
field.setValue("192753");
field = pDAcroForm.getField("tbxTotal");
field.setValue("Total: €192.00");
pDAcroForm.flatten();
pDDocument.save("/media/stoyank/Elements/Java/tmp/output.pdf");
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
This is how the output document looks like:
I tried to add the font manually by referring to this Stackoverflow question, but still no success:
PDDocument pDDocument = PDDocument.load(new File("/media/stoyank/Elements/Java/tmp/Receipt.pdf"));
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
InputStream font_file = ClassLoader.getSystemResourceAsStream("Oswald-Regular.ttf");
PDType0Font font = PDType0Font.load(pDDocument, font_file, false);
if (font_file != null) font_file.close();
PDResources resources = pDAcroForm.getDefaultResources();
if (resources == null) resources = new PDResources();
resources.put(COSName.getPDFName("Oswald-Regular"), font);
pDAcroForm.setDefaultResources(resources);
pDAcroForm.refreshAppearances();
PDField field = pDAcroForm.getField("tbxOrderId");
field.setValue("192753");
field = pDAcroForm.getField("tbxTotal");
field.setValue("Total: €192.00");
pDAcroForm.flatten();
pDDocument.save("/media/stoyank/Elements/Java/tmp/output.pdf");
pDDocument.close();
After I write into these textbox fields, I want to flatten the document.
Here is my folder structure:
System: Ubuntu 20.04
Also, here is a link to the ODT file that I then export to a PDF and the exported PDF.

Your file doesn't have correct appearance streams for the fields, this is a bug from the software that created the PDF. Call pDAcroForm.refreshAppearances(); as early as possible.
The code in pastebin is fine (it is based on CreateSimpleFormWithEmbeddedFont.java example), except that you should keep the default resources and not start with empty resources. So your code should look like this:
pDAcroForm.refreshAppearances();
PDType0Font formFont = PDType0Font.load(pDDocument, ...input stream..., false);
PDResources resources = pDAcroForm.getDefaultResources();
if (resources == null)
{
resources = new PDResources();
pDAcroForm.setDefaultResources(resources);
}
final String fontName = resources.add(formFont).getName();
// Acrobat sets the font size on the form level to be
// auto sized as default. This is done by setting the font size to '0'
String defaultAppearanceString = "/" + fontName + " 0 Tf 0 g";
PDTextField field = (PDTextField) (pDAcroForm.getField("tbxTotal"));
field.setDefaultAppearance(defaultAppearanceString);
field.setValue("Total: €192.00");

Printing Chinese characters in pdfbox

I'm using the following set-up:
Java 11.0.1
pdfbox 2.0.15
Objective: Rendering a pdf that contains Chinese characters
Problem: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding
I already tried:
Using different fonts for Chinese character support. The latest one is NotoSansCJKtc-Regular.ttf
Set font to unicode as described here: Java: Write national characters to PDF using PDFBox, however the used loadTTF method is deprecated.
Using Arial-Unicode-MS_4302.ttf
My code looks like this (shortened a bit):
try (InputStream pdfIn = inputStream; PDDocument pdfDocument =
PDDocument.load(pdfIn)) {
PDFont formFont;
//Check if Chinese characters are present
if (!Util.containsHanScript(queryString)) {
formFont = PDType0Font.load(pdfDocument,
PdfReportGenerator.class.getResourceAsStream("LiberationSans-Regular.ttf"),
false);
} else {
formFont = PDType0Font.load(pdfDocument,
PdfReportGenerator.class.getResourceAsStream("NotoSansCJKtc-Regular.ttf"),
false);
}
List<PDField> fields = acroForm.getFields();
//Load fields into Map
Map<String, PDField> pdfFields = new HashMap<>();
for (PDField field : fields) {
String key = field.getPartialName();
pdfFields.put(key, field);
}
PDField currentField = pdfFields.get("someFieldID");
PDVariableText pdfield = (PDVariableText) currentField;
PDResources res = acroForm.getDefaultResources();
String fontName = res.add(formFont).getName();
String defaultAppearanceString = "/" + fontName + " 10 Tf 0 g";
pdfield.setDefaultAppearance(defaultAppearanceString);
pdfield.setValue("李柱");
acroForm.flatten(fields, true);
ByteArrayOutputStream pdfOut = new ByteArrayOutputStream();
pdfDocument.save(pdfOut);
}
Expected result: Chinese characters on pdf.
Actual result: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding
So my question is about how to best support rendering of Chinese characters with pdfbox. Any help is appreciated.

The following code works for me, it uses the file of PDFBOX-4629:
PDDocument doc = PDDocument.load(new URL("https://issues.apache.org/jira/secure/attachment/12977270/Report_Template_DE.pdf").openStream());
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDVariableText field = (PDVariableText) acroForm.getField("search_query");
List<PDField> fields = acroForm.getFields();
PDFont font = PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arialuni.ttf"), false);
PDResources res = acroForm.getDefaultResources();
String fontName = res.add(font).getName();
String defaultAppearanceString = "/" + fontName + " 10 Tf 0 g";
field.setDefaultAppearance(defaultAppearanceString);
field.setValue("李柱");
acroForm.flatten(fields, true);
doc.save("saved.pdf");
doc.close();

PDFBox does not correctly render Simsun (chinese) font

Context
I am writing a Java code which fill PDF Forms using PDFBox with some user inputs.
Some of the inputs are in Chinese.
When I generated the PDF, I don't have any errors in the logs but the rendered text is absolutely not the same.
What I currently have
Here is what I do:
In the PDF file, I specified the SimSun font for the field using Adobe Pro.
This font handle Simplified Chinese characters.
I have the font SimSun installed on my server.
PDFBox doesn't display any error (if I remove the SimSun font from my server then PDFBox fallback on another font that is not able to render the characters). So i guess it is able to find the font and use it.
What I tried
I was able to make this work but I had to manually load the font in the code and add it to the PDF (see examples below).
But that is not a solution as it means that I would have to load the font every time and add it the the PDF. I would also have to do the same for many other languages.
As far as I understood, PDFBox should be able to use any fonts installed on the server.
Below is a test class that tries 3 different approaches. Only the last one works so far:
Classic generation
Simply put Chinese characters inside the text field without changing anything.
The characters are not rendered correctly (some of them are missing and the ones displayed does not match the input).
Generation with embedded font
Try to embed the SimSun font inside the PDF with the PDResource.add(font) method.
The result is the same as the first method.
Embed the font and use it
I embed the SimSun font and I also override the font used in the TextField to use the SimSun font I just added.
This approach works.
After quite a few readings, I found out that the issue might come from the version of the font I am using.
Windows 8 (which I use to create the form) uses v5.04 of Simsun font.
I use v2.10 on my laptop and my servers, both being Linux based (I can not find the v5.04).
However, I don't know:
If the issue is really coming from this.
If I have the right to use this font, as it is developed by Microsoft (and Apple).
Where to find the latest version of it.
I tried using another font but:
I only find OTF fonts (and not TTF) that support Chinese characters.
PDFBox does not support OTF (yet). It is planed for v3.0.0.
So if someone has an idea on how to make this work without having to embed and change the font's name in the code, that would be great!
Here are the PDF I used and the code that tests the 3 methods I talked about.
The TextField in the pdf is named comment.
package org.test;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Hello world!
*/
public class App {
private static final String SIMPLIFIED_CHINESE_STRING = "我不明白为什么它不起作用。";
public static void main(String[] args) throws IOException {
System.out.println("Hello World!");
// Test 1
classicGeneration();
// Test 2
generationWithEmbededFont();
Test 3
generationWithFontOverride();
System.out.println("Bye!");
}
/**
* Classic PDF generation without any changes to the PDF.
*/
private static void classicGeneration() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDField commentField = acroForm.getField("comment");
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-classic-generation.pdf"));
}
/**
* Trying to embed the font in the PDF. It doesn't seem to work.
* The result is the same as classicGeneration method.
*/
private static void generationWithEmbededFont() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDFont font = PDType0Font.load(document, new File("/usr/share/fonts/SimSun.ttf"));
PDResources res = acroForm.getDefaultResources();
if (res == null) {
res = new PDResources();
}
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
PDField commentField = acroForm.getField("comment");
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-with-embeded-font.pdf"));
}
/**
* Embed the font in the PDF and change the font used in the TextField to use this one.
* Here the PDF is correctly rendered and all the characters are displayed.
* #throws IOException
*/
private static void generationWithFontOverride() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDField commentField = acroForm.getField("comment");
// Load the font
InputStream resourceAsStream = Thread.currentThread().getContextClassLoader().getResourceAsStream("SimSun.ttf");
PDFont font = PDType0Font.load(document, resourceAsStream);
PDResources res = acroForm.getDefaultResources();
if (res == null) {
res = new PDResources();
}
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
// Change the font used by the TextField
COSDictionary dict = commentField.getCOSObject();
COSString defaultAppearance = (COSString) dict.getDictionaryObject(COSName.DA);
if (defaultAppearance != null) {
String currentFont = dict.getString(COSName.DA);
// Retrieve the current font size and color used for the field in order to use the same but with the new font.
String regex = "[\\w]* ([\\w\\s]*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(currentFont);
// Default font size if we fail to extract the current one
String fontSize = " 11 Tf";
if (matcher.find()) {
fontSize = " " + matcher.group(1);
}
// Change the font of the TextField.
dict.setString(COSName.DA, "/" + fontName.getName() + fontSize);
}
commentField.getCOSObject().addAll(dict);
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-with-font-override.pdf"));
}
// HELPER
private static PDDocument loadPdf() throws IOException {
InputStream stream = Thread.currentThread().getContextClassLoader().getResourceAsStream("sample.pdf");
return PDDocument.load(stream);
}
}

PDFBox not printing form fields

I'm trying to print a post-processed (filled) PDF-Template, which was created in LibreOffice and contains filled out form field.
The PDFBox svn is nice and has a lot of examples how to do so. Getting the PDF and the AcroFormat of it is easy, and even editing and saving the modified PDF to disk works as expected. But this is not my goal. I want a PDF which has the fields filled out and then being removed with only the text remaining.
I tried everything on stackoverflow regarding PDFBox, from flatting the acroform to setting readonly properties on the fields and other meta info, installed the necessary fonts and much more. Everytime I printed the PDF to file, the text (edited and non edited) which was in a text field disappeared and the textfields were gone.
But then I tried to create a PDF from scratch with PDFBox and printing works like expected. The textfields were in the generated template and the printed pdf file contained the text I wanted, with the corresponding forms removed.
So I used the PDF Debugger from PDFBox to analyse the structure of the PDF and noticed that within the preview of the debugger, my PDF does not contain the text in the text field, exported from LibreOffice. BUT in the tree structure the PDF Annotation is clearly there (/DV and /V) and looks quiet similar to the pdfbox created version, which is working.
For testing I created a simple pdf with just one text field with name "test" and content "Foobar". Also the background and border color were changed to see if anything was successfully printed out.
PDDocument document = null;
try {
document = PDDocument.load(new File("<filepath>\\<filename>"));
} catch (final InvalidPasswordException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (final IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
PrintFields.createDummyPDF("<filepath>\\<filename>");
PrintFields.printFields(document); //debug output
//Getting pdf meta infos
final PDDocumentCatalog docCatalog = document.getDocumentCatalog();
final PDAcroForm acroForm = docCatalog.getAcroForm();
docCatalog.setAcroForm(acroForm);
//setting the appearance
final PDFont font = PDType1Font.HELVETICA;
final PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
acroForm.setDefaultResources(resources);
String defaultAppearanceString = "/Helv 0 Tf 0 g";
acroForm.setDefaultAppearance(defaultAppearanceString);
for(final PDField f : acroForm.getFields()) {
if(f instanceof PDTextField) {
defaultAppearanceString = "/Helv 12 Tf 0 0 1 rg";
final List<PDAnnotationWidget> widgets = ((PDTextField)f).getWidgets();
widgets.get(0).setAppearanceState(defaultAppearanceString);
}
}
for(final PDField f : acroForm.getFields()) {
f.setReadOnly(true);
}
// save modified pdf to file
document.save("<filepath>\\<filename>");
//print to file (to pdf)
if (job.printDialog()) {
try {
// Desktop.getDesktop().print();
job.print();
} catch (final PrinterException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
// copied from pdfbox examples
public static void createDummyPDF(final String path) throws IOException
{
// Create a new document with an empty page.
try (PDDocument document = new PDDocument())
{
final PDPage page = new PDPage(PDRectangle.A4);
document.addPage(page);
// Adobe Acrobat uses Helvetica as a default font and
// stores that under the name '/Helv' in the resources dictionary
final PDFont font = PDType1Font.HELVETICA;
final PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
// Add a new AcroForm and add that to the document
final PDAcroForm acroForm = new PDAcroForm(document);
document.getDocumentCatalog().setAcroForm(acroForm);
// Add and set the resources and default appearance at the form level
acroForm.setDefaultResources(resources);
// Acrobat sets the font size on the form level to be
// auto sized as default. This is done by setting the font size to '0'
String defaultAppearanceString = "/Helv 0 Tf 0 g";
acroForm.setDefaultAppearance(defaultAppearanceString);
// Add a form field to the form.
final PDTextField textBox = new PDTextField(acroForm);
textBox.setPartialName("SampleField");
// Acrobat sets the font size to 12 as default
// This is done by setting the font size to '12' on the
// field level.
// The text color is set to blue in this example.
// To use black, replace "0 0 1 rg" with "0 0 0 rg" or "0 g".
defaultAppearanceString = "/Helv 12 Tf 0 0 1 rg";
textBox.setDefaultAppearance(defaultAppearanceString);
// add the field to the acroform
acroForm.getFields().add(textBox);
// Specify the widget annotation associated with the field
final PDAnnotationWidget widget = textBox.getWidgets().get(0);
final PDRectangle rect = new PDRectangle(50, 750, 200, 50);
widget.setRectangle(rect);
widget.setPage(page);
// set green border and yellow background
// if you prefer defaults, just delete this code block
final PDAppearanceCharacteristicsDictionary fieldAppearance
= new PDAppearanceCharacteristicsDictionary(new COSDictionary());
fieldAppearance.setBorderColour(new PDColor(new float[]{0,1,0}, PDDeviceRGB.INSTANCE));
fieldAppearance.setBackground(new PDColor(new float[]{1,1,0}, PDDeviceRGB.INSTANCE));
widget.setAppearanceCharacteristics(fieldAppearance);
// make sure the widget annotation is visible on screen and paper
widget.setPrinted(true);
// Add the widget annotation to the page
page.getAnnotations().add(widget);
// set the field value
textBox.setValue("Sample field");
document.save(path);
}
}
//copied from pdfbox examples
public static void processFields(final List<PDField> fields, final PDResources resources) {
fields.stream().forEach(f -> {
f.setReadOnly(true);
final COSDictionary cosObject = f.getCOSObject();
final String value = cosObject.getString(COSName.DV) == null ?
cosObject.getString(COSName.V) : cosObject.getString(COSName.DV);
System.out.println("Setting " + f.getFullyQualifiedName() + ": " + value);
try {
f.setValue(value);
} catch (final IOException e) {
if (e.getMessage().matches("Could not find font: /.*")) {
final String fontName = e.getMessage().replaceAll("^[^/]*/", "");
System.out.println("Adding fallback font for: " + fontName);
resources.put(COSName.getPDFName(fontName), PDType1Font.HELVETICA);
try {
f.setValue(value);
} catch (final IOException e1) {
e1.printStackTrace();
}
} else {
e.printStackTrace();
}
}
if (f instanceof PDNonTerminalField) {
processFields(((PDNonTerminalField) f).getChildren(), resources);
}
});
I would expect that the pdfs generated by the document.save() and job.print() to look identical in the Viewer, but they do not.
If I take the document.save() generated pdf with readonly disabled, I can use a PDF Viewer like FoxitReader to fill the form and print it again. This produces the right output. Using the job.print() version, leads to disappearing of the text contained in the (text) form field.
Has anyone a clue why this is the case?
I'm using PDFBox 2.0.13 (latest release) and LibreOffice 6.1.4.2.
Here are the refered files and here you can download the debugger (jar file, runnable with java -jar ).

How to change font of an embedded resource using PDFBox

I'm trying to get rid of a custom font that has been used for years. Due to regulations I need to replace this font with a common one.
Anyways, I've tried to write a JUnit Test to change the font of a pdf using PDFBox.
This is what I have done:
#Test
public void changeFontOfAllPdfsToArial() throws Exception {
PDDocument document = PDDocument.load(new File("src/test/broken_pdf.pdf"));
for(PDPage page : document.getPages()) {
PDResources resources = page.getResources();
for(COSName key : resources.getFontNames()) {
PDFont font = resources.getFont(key);
System.out.println(font.getFontDescriptor().getFontName());
if(resources.getFont(key).toString().contains("CUSTOM")) {
}
}
}
document.save(new File(PDFs.get(0).getAbsolutePath() + "_test"));
}
Iterating through the list gives me all the fonts of the document.
I'm getting the COSName key of the resource, but how do I change the font of it? Thanks for your help!
€: Just to mention: The font is embedded.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to save Arabic words in a PDF - PDFBox Java - java

Related

How to use TTF font with PDFBox AcroForm and then flatten document?

Printing Chinese characters in pdfbox

PDFBox does not correctly render Simsun (chinese) font

PDFBox not printing form fields

How to change font of an embedded resource using PDFBox

Categories

Resources