PDFBox does not correctly render Simsun (chinese) font - java

Context
I am writing a Java code which fill PDF Forms using PDFBox with some user inputs.
Some of the inputs are in Chinese.
When I generated the PDF, I don't have any errors in the logs but the rendered text is absolutely not the same.
What I currently have
Here is what I do:
In the PDF file, I specified the SimSun font for the field using Adobe Pro.
This font handle Simplified Chinese characters.
I have the font SimSun installed on my server.
PDFBox doesn't display any error (if I remove the SimSun font from my server then PDFBox fallback on another font that is not able to render the characters). So i guess it is able to find the font and use it.
What I tried
I was able to make this work but I had to manually load the font in the code and add it to the PDF (see examples below).
But that is not a solution as it means that I would have to load the font every time and add it the the PDF. I would also have to do the same for many other languages.
As far as I understood, PDFBox should be able to use any fonts installed on the server.
Below is a test class that tries 3 different approaches. Only the last one works so far:
Classic generation
Simply put Chinese characters inside the text field without changing anything.
The characters are not rendered correctly (some of them are missing and the ones displayed does not match the input).
Generation with embedded font
Try to embed the SimSun font inside the PDF with the PDResource.add(font) method.
The result is the same as the first method.
Embed the font and use it
I embed the SimSun font and I also override the font used in the TextField to use the SimSun font I just added.
This approach works.
After quite a few readings, I found out that the issue might come from the version of the font I am using.
Windows 8 (which I use to create the form) uses v5.04 of Simsun font.
I use v2.10 on my laptop and my servers, both being Linux based (I can not find the v5.04).
However, I don't know:
If the issue is really coming from this.
If I have the right to use this font, as it is developed by Microsoft (and Apple).
Where to find the latest version of it.
I tried using another font but:
I only find OTF fonts (and not TTF) that support Chinese characters.
PDFBox does not support OTF (yet). It is planed for v3.0.0.
So if someone has an idea on how to make this work without having to embed and change the font's name in the code, that would be great!
Here are the PDF I used and the code that tests the 3 methods I talked about.
The TextField in the pdf is named comment.
package org.test;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDType0Font;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* Hello world!
*/
public class App {
private static final String SIMPLIFIED_CHINESE_STRING = "我不明白为什么它不起作用。";
public static void main(String[] args) throws IOException {
System.out.println("Hello World!");
// Test 1
classicGeneration();
// Test 2
generationWithEmbededFont();
Test 3
generationWithFontOverride();
System.out.println("Bye!");
}
/**
* Classic PDF generation without any changes to the PDF.
*/
private static void classicGeneration() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDField commentField = acroForm.getField("comment");
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-classic-generation.pdf"));
}
/**
* Trying to embed the font in the PDF. It doesn't seem to work.
* The result is the same as classicGeneration method.
*/
private static void generationWithEmbededFont() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDFont font = PDType0Font.load(document, new File("/usr/share/fonts/SimSun.ttf"));
PDResources res = acroForm.getDefaultResources();
if (res == null) {
res = new PDResources();
}
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
PDField commentField = acroForm.getField("comment");
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-with-embeded-font.pdf"));
}
/**
* Embed the font in the PDF and change the font used in the TextField to use this one.
* Here the PDF is correctly rendered and all the characters are displayed.
* #throws IOException
*/
private static void generationWithFontOverride() throws IOException {
PDDocument document = loadPdf();
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
PDField commentField = acroForm.getField("comment");
// Load the font
InputStream resourceAsStream = Thread.currentThread().getContextClassLoader().getResourceAsStream("SimSun.ttf");
PDFont font = PDType0Font.load(document, resourceAsStream);
PDResources res = acroForm.getDefaultResources();
if (res == null) {
res = new PDResources();
}
COSName fontName = res.add(font);
acroForm.setDefaultResources(res);
// Change the font used by the TextField
COSDictionary dict = commentField.getCOSObject();
COSString defaultAppearance = (COSString) dict.getDictionaryObject(COSName.DA);
if (defaultAppearance != null) {
String currentFont = dict.getString(COSName.DA);
// Retrieve the current font size and color used for the field in order to use the same but with the new font.
String regex = "[\\w]* ([\\w\\s]*)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(currentFont);
// Default font size if we fail to extract the current one
String fontSize = " 11 Tf";
if (matcher.find()) {
fontSize = " " + matcher.group(1);
}
// Change the font of the TextField.
dict.setString(COSName.DA, "/" + fontName.getName() + fontSize);
}
commentField.getCOSObject().addAll(dict);
commentField.setValue(SIMPLIFIED_CHINESE_STRING);
document.save(new File("result-with-font-override.pdf"));
}
// HELPER
private static PDDocument loadPdf() throws IOException {
InputStream stream = Thread.currentThread().getContextClassLoader().getResourceAsStream("sample.pdf");
return PDDocument.load(stream);
}
}

Related

How to use TTF font with PDFBox AcroForm and then flatten document?

I have been trying to make a fillable PDF file with LibreOffice Writer 7.2.2.2. Here is how the document looks like:
All fields right of the vertical lines are form textboxes, each one having its own name(tbxOrderId, tbxFullName...). Each textbox uses SF Pro Text Light as font. Only the one on the bottom right(tbxTotal) - Total €123.00 has Oswald Regular. The document looks alright when I fill these fields with LibreOffice Writer.
Below this are my export settings. I chose Archive PDF A-2b in order to embed the fonts into the document.
Here is the output when I run pdffonts to the exported PDF file.
However, when I run the following code which just changes the values of tbxOrderId and tbxTotal, the output PDF document is missing these fonts.
public class Start {
public static void main(String[] args) {
try {
PDDocument pDDocument = PDDocument.load(new File("/media/stoyank/Elements/Java/tmp/Receipt.pdf"));
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
PDField field = pDAcroForm.getField("tbxOrderId");
field.setValue("192753");
field = pDAcroForm.getField("tbxTotal");
field.setValue("Total: €192.00");
pDAcroForm.flatten();
pDDocument.save("/media/stoyank/Elements/Java/tmp/output.pdf");
pDDocument.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
This is how the output document looks like:
I tried to add the font manually by referring to this Stackoverflow question, but still no success:
PDDocument pDDocument = PDDocument.load(new File("/media/stoyank/Elements/Java/tmp/Receipt.pdf"));
PDAcroForm pDAcroForm = pDDocument.getDocumentCatalog().getAcroForm();
InputStream font_file = ClassLoader.getSystemResourceAsStream("Oswald-Regular.ttf");
PDType0Font font = PDType0Font.load(pDDocument, font_file, false);
if (font_file != null) font_file.close();
PDResources resources = pDAcroForm.getDefaultResources();
if (resources == null) resources = new PDResources();
resources.put(COSName.getPDFName("Oswald-Regular"), font);
pDAcroForm.setDefaultResources(resources);
pDAcroForm.refreshAppearances();
PDField field = pDAcroForm.getField("tbxOrderId");
field.setValue("192753");
field = pDAcroForm.getField("tbxTotal");
field.setValue("Total: €192.00");
pDAcroForm.flatten();
pDDocument.save("/media/stoyank/Elements/Java/tmp/output.pdf");
pDDocument.close();
After I write into these textbox fields, I want to flatten the document.
Here is my folder structure:
System: Ubuntu 20.04
Also, here is a link to the ODT file that I then export to a PDF and the exported PDF.
Your file doesn't have correct appearance streams for the fields, this is a bug from the software that created the PDF. Call pDAcroForm.refreshAppearances(); as early as possible.
The code in pastebin is fine (it is based on CreateSimpleFormWithEmbeddedFont.java example), except that you should keep the default resources and not start with empty resources. So your code should look like this:
pDAcroForm.refreshAppearances();
PDType0Font formFont = PDType0Font.load(pDDocument, ...input stream..., false);
PDResources resources = pDAcroForm.getDefaultResources();
if (resources == null)
{
resources = new PDResources();
pDAcroForm.setDefaultResources(resources);
}
final String fontName = resources.add(formFont).getName();
// Acrobat sets the font size on the form level to be
// auto sized as default. This is done by setting the font size to '0'
String defaultAppearanceString = "/" + fontName + " 0 Tf 0 g";
PDTextField field = (PDTextField) (pDAcroForm.getField("tbxTotal"));
field.setDefaultAppearance(defaultAppearanceString);
field.setValue("Total: €192.00");

Annotation content not appearing with PDFBox

Update: This is working in adobe reader, but not in the osx default pdf reader. Many of our users use the default osx reader so ideally I could get it working there, I know it supports annotations)
I am using Apache PDFBox 2.0.22 to try and add annotations to a pdf programmatically. The code I have runs, and produces a pdf with an annotation, but the content's text of the annotation is empty (See screenshot). What am I doing wrong?
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.pdmodel.graphics.color.PDColor;
import org.apache.pdfbox.pdmodel.graphics.color.PDDeviceRGB;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationTextMarkup;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class Sample{
public static void main(String[] args) throws FileNotFoundException, IOException{
PDDocument doc = PDDocument.load(new File("test.pdf"));
try {
//insert new page
PDPage page = (PDPage) doc.getDocumentCatalog().getPages().get(0);
List<PDAnnotation> annotations = page.getAnnotations();
//generate instanse for annotation
PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
//set the rectangle
PDRectangle position = new PDRectangle();
position.setLowerLeftX(170);
position.setLowerLeftY(125);
position.setUpperRightX(195);
position.setUpperRightY(140);
txtMark.setRectangle(position);
//set the quadpoint
float[] quads = new float[8];
//x1,y1
quads[0] = position.getLowerLeftX();
quads[1] = position.getUpperRightY() - 2;
//x2,y2
quads[2] = position.getUpperRightX();
quads[3] = quads[1];
//x3,y3
quads[4] = quads[0];
quads[5] = position.getLowerLeftY() - 2;
//x4,y4
quads[6] = quads[2];
quads[7] = quads[5];
txtMark.setQuadPoints(quads);
txtMark.setAnnotationName("My annotation");
txtMark.setTitlePopup("title popup");
txtMark.setContents("Highlighted since it's important");
txtMark.setRichContents("Here is some rich content");
PDColor blue = new PDColor(new float[] { 0, 0, 1 }, PDDeviceRGB.INSTANCE);
txtMark.setColor(blue);
annotations.add(txtMark);
page.setAnnotations(annotations);
doc.save("test-out.pdf");
}finally
{
doc.close();
}
}
}
just adding this as an answer to have a better formatting.
PDFBOX is constantly changing, and maybe .constructAppearances() works better at some point - but meanwhile, ... I myself followed the advice to create my own appearance stream for the annotation. But it still did not appear to work. I had tried everything, it seemed - and in every case sometimes the annotations would appear in Acrobat Reader DC, but they would not show in Chrome or Firefox default PDF viewer.
First of all - if you setContent() on the annotation object itself, AND if you don't supply your own appearance stream - the viewers like Acrobat Reader DC will sometimes try to construct their own version of appearance - which varies across viewers - now if you supply your own, .. some viewers will still construct their own.. e.g. interactive elements like a small yellow icon that indicates that we have an annotation here and which shows the contents if you hover the mouse cursor over it - that's why, I guess, some of the programmers will try to not call setContent() (?)
Moreover, if you, like me, choose to use PDAnnotationTextMarkup with the SUB_TYPE_FREETEXT subtype - to just write something over the existing PDF, Acrobat Reader DC specifically will attempt to create its own visual something - it seems that it tries to modify (to correct?) the contents of the document, producing and showing a slightly changed version, which will result in existing digital signatures changing state - we can overcome this problem according to this SO q&a.
But it seems that there is a better way to write a static text over the PDF which Acrobat Reader DC will not try to modify, and it's the Rubber Stamp annotation. You can plainly substitute PDAnnotationTextMarkup (and PDFBOX 3.0.0 has another name for that..) with PDAnnotationRubberStamp. Everything else stays the same.
So what I was doing wrong in all my annotations code was to misplace the bounding box for the annotation's appearance. setBBox function takes as a parameter a PDRectangle, and PDAnnotation.getRectangle() returns a PDRectangle - but the first cannot use the second as is! Because the second PDRectangle coordinates are in relation to the page's lower left corner, while we need a rectangle that has 0, 0 as X, Y.
So the code that I came up with is as follows (and I haven't yet starting with _font and the font embedding in general which seems like a huge topic..):
private void addAnnotation(String name, PDDocument doc, PDPage page, float x, float y, String text) throws IOException {
List<PDAnnotation> annotations = page.getAnnotations();
PDAnnotationRubberStamp t = new PDAnnotationRubberStamp();
t.setAnnotationName(name); // might play important role
t.setPrinted(true); // always visible
t.setReadOnly(true); // does not interact with user
t.setContents(text);
// calculate realWidth, realHeight according to font size (e.g. using _font.getStringWidth(text))
float realWidth = 100, realHeight = 100;
PDRectangle rect = new PDRectangle(x, y, realWidth, realHeight);
t.setRectangle(rect);
PDAppearanceDictionary ap = new PDAppearanceDictionary();
ap.setNormalAppearance(createAppearanceStream(doc, t));
t.setAppearance(ap);
annotations.add(t);
page.setAnnotations(annotations);
// these must be set for incremental save to work properly (PDFBOX < 3.0.0 at least?)
ap.getCOSObject().setNeedToBeUpdated(true);
t.getCOSObject().setNeedToBeUpdated(true);
page.getResources().getCOSObject().setNeedToBeUpdated(true);
page.getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getPages().getCOSObject().setNeedToBeUpdated(true);
doc.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);
}
private void modifyAppearanceStream(PDAppearanceStream aps, PDAnnotation ann) throws IOException {
PDAppearanceContentStream apsContent = null;
try {
PDRectangle rect = ann.getRectangle();
rect = new PDRectangle(0, 0, rect.getWidth(), rect.getHeight()); // need to be relative - this is mega important because otherwise it appears as if nothing is printed
aps.setBBox(rect); // set bounding box to the dimensions of the annotation itself
// embed our unicode font (NB: yes, this needs to be done otherwise aps.getResources() == null which will cause NPE later during setFont)
PDResources res = new PDResources();
_fontName = res.add(_font).getName(); // okay I create _font elsewhere
aps.setResources(res);
// draw directly on the XObject's content stream
apsContent = new PDAppearanceContentStream(aps);
apsContent.beginText();
apsContent.setFont(PDType1Font.HELVETICA_BOLD, _fontSize); // _font
apsContent.setTextMatrix(Matrix.getTranslateInstance(0, 1));
apsContent.showText(ann.getContents());
apsContent.endText();
}
finally {
if (apsContent != null) {
try { apsContent.close(); } catch (Exception ex) { log.error(ex.getMessage(), ex); }
}
}
aps.getResources().getCOSObject().setNeedToBeUpdated(true);
aps.getCOSObject().setNeedToBeUpdated(true);
}
private PDAppearanceStream createAppearanceStream(final PDDocument document, PDAnnotation ann) throws IOException
{
PDAppearanceStream aps = new PDAppearanceStream(document);
modifyAppearanceStream(aps, ann);
return aps;
}

Unable to save Arabic words in a PDF - PDFBox Java

Trying to save Arabic words in an editable PDF. It works all fine with English ones but when I use Arabic words, I am getting this exception:
java.lang.IllegalArgumentException:
U+0627 is not available in this font Helvetica encoding: WinAnsiEncoding
Here is how I generated PDF:
public static void main(String[] args) throws IOException
{
String formTemplate = "myFormPdf.pdf";
try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
{
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
if (acroForm != null)
{
PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
field.setValue("جملة");
}
pdfDocument.save("updatedPdf.pdf");
}
}
That's how I made it work, I hope it would help others. Just use the font that is supported by the language that you want to use in the PDF.
public static void main(String[] args) throws IOException
{
String formTemplate = "myFormPdf.pdf";
try (PDDocument pdfDocument = PDDocument.load(new File(formTemplate)))
{
PDAcroForm acroForm = pdfDocument.getDocumentCatalog().getAcroForm();
// you can read ttf from resources as well, this is just for testing
PDFont font = PDType0Font.load(pdfDocument,new File("/path/to/font.ttf"));
String fontName = acroForm.getDefaultResources().add(pdfont).getName();
if (acroForm != null)
{
PDTextField field = (PDTextField) acroForm.getField( "sampleField" );
field.setDefaultAppearance("/"+fontName +" 0 Tf 0 g");
field.setValue("جملة");
}
pdfDocument.save("updatedPdf.pdf");
}
}
Edited: Adding the comment of mkl
The font name and the font size are parameters of the Tf instruction, and the gray value 0 for black is the parameter for the g instruction. Parameters and instruction names must be appropriately separated.
You need a font which supports those Arabic symbols.
Once you've got a compatible font, you can load it using PDType0Font
final PDFont font = PDType0Font.load(...);
A Type 0 font is a font which references multiple other fonts' formats, and can, potentially, load all available symbols.
See also the Cookbook - working with fonts (no examples with Type 0, but still useful).

How to change font of an embedded resource using PDFBox

I'm trying to get rid of a custom font that has been used for years. Due to regulations I need to replace this font with a common one.
Anyways, I've tried to write a JUnit Test to change the font of a pdf using PDFBox.
This is what I have done:
#Test
public void changeFontOfAllPdfsToArial() throws Exception {
PDDocument document = PDDocument.load(new File("src/test/broken_pdf.pdf"));
for(PDPage page : document.getPages()) {
PDResources resources = page.getResources();
for(COSName key : resources.getFontNames()) {
PDFont font = resources.getFont(key);
System.out.println(font.getFontDescriptor().getFontName());
if(resources.getFont(key).toString().contains("CUSTOM")) {
}
}
}
document.save(new File(PDFs.get(0).getAbsolutePath() + "_test"));
}
Iterating through the list gives me all the fonts of the document.
I'm getting the COSName key of the resource, but how do I change the font of it? Thanks for your help!
€: Just to mention: The font is embedded.

extract image from image

Is it possible to extract an image from a jpeg, png or tiff file? NOT PDF! Suppose I have a file containing both text and images in jpeg format (so it's basically a picture); I want to be able to extract the image only programmatically (preferably using Java). If anyone knows useful libraries please let me know. I have already tried AspriseOCR and tesseract-ocr, they have been successful at extracting text only (obviously).
Thank you.
Try :
int startProintX = xxx;
int startProintY = xxx;
int endProintX = xxx;
int endProintY = xxx;
BufferedImage image = ImageIO.read(new File("D:/temp/test.jpg"));
BufferedImage out = image.getSubimage(startProintX, startProintY, endProintX, endProintY);
ImageIO.write(out, "jpg", new File("D:/temp/result.jpg"));
These point are region of image you want to extract.
Extract image from pdf file
I suggest to change your post tile. You can use pdfbox or iText api. The below example to extract the all of the image from pdf file.
There might be some resource for you. If there are a lot of image in pdf, may be occur java.lang.OutOfMemoryError.
Download pdfbox.xx.jar here.
import java.io.File;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.pdfbox.PDFBox;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
import org.jdom.Document;
public class ExtractImagesFromPDF {
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("D:/temp/test.pdf"));
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator();
while(iter.hasNext()) {
PDPage page = (PDPage)iter.next();
PDResources resources = page.getResources();
Map images = resources.getImages();
if( images != null ) {
Iterator imageIter = images.keySet().iterator();
while(imageIter.hasNext()) {
String key = (String)imageIter.next();
System.out.println("Key : " + key);
PDXObjectImage image = (PDXObjectImage)images.get(key);
File file = new File("D:/temp/" + key + "." + image.getSuffix());
image.write2file(file);
}
}
}
}
}
Extract specific image from pdf file
To extract specific image, you have to know index of page and index of image of that page. Otherwise, you cannot extract.
The following example program extract first image of first page.
int targetPage = 0;
PDPage firstPage = (PDPage)document.getDocumentCatalog().getAllPages().get(targetPage);
PDResources resources = firstPage.getResources();
Map images = resources.getImages();
int targetImage = 0;
String imageKey = "Im" + targetImage;
PDXObjectImage image = (PDXObjectImage)images.get(imageKey);
File file = new File("D:/temp/" + imageKey + "." + image.getSuffix());
image.write2file(file);
If you are interested in an out-of-box product that could do this via black-box processing with minimal non-programming configuration (since you tried other products), then ABBYY FlexiCapture can do it. It can be configured to look for dynamic sizes of pictures/objects in loosely defined areas, or anywhere on the page, with full control over search logic. I used it once to extract lines of specific shape and thickness to separate chapters of a book, where each line indicated a new chapter, and could be anywhere on the page.

Categories