My goal is to transfer textual content from a PDF to a new PDF while preserving the formatting of the font. (e.g. Bold, Italic, underlined..).
I try to use the TextPosition List from the existing PDF and write a new PDF from it.
For this I get from the TextPosition List the Font and FontSize of the current entry and set them in a contentStream to write the upcoming text through contentStream.showText().
after 137 successful loops this error follows:
Exception in thread "main" java.lang.IllegalArgumentException: No glyph for U+00AD in font VVHOEY+FrutigerLT-BoldCn
at org.apache.pdfbox.pdmodel.font.PDType1CFont.encode(
at org.apache.pdfbox.pdmodel.font.PDFont.encode(
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(
at haupt.PageTest.printPdf(
at haupt.MyTestPDF.main(
This is my code up to this step:
public void printPdf() throws IOException {
TextPosition tpInfo = null;
String pdfFileInText = null;
int charIDindex = 0;
int pageIndex = 0;
try (PDDocument pdfDocument = PDDocument.load(new File(srcFile))) {
if (!pdfDocument.isEncrypted()) {
MyPdfTextStripper myStripper = new MyPdfTextStripper();
var articlesByPage = myStripper.getCharactersByArticleByPage(pdfDocument);
String newFileString = (srcErledigt + "Test.pdf");
File input = new File(newFileString);
PDDocument document = new PDDocument();
// For Pages
for (Iterator<List<List<TextPosition>>> pageIterator = articlesByPage.iterator(); pageIterator.hasNext();) {
List<List<TextPosition>> pageList =;
PDPage newPage = new PDPage();
PDPageContentStream contentStream = new PDPageContentStream(document, newPage);
// For Articles
for (Iterator<List<TextPosition>> articleIterator = pageList.iterator(); articleIterator.hasNext();) {
List<TextPosition> articleList =;
// For Text
for (Iterator<TextPosition> tpIterator = articleList.iterator(); tpIterator.hasNext();) {
tpCharID = charIDindex;
tpInfo =;
System.out.println(tpCharID + ". charID: " + tpInfo);
PDFont tpFont = tpInfo.getFont();
float tpFontSize = tpInfo.getFontSize();
pdfFileInText = tpInfo.toString();
contentStream.setFont(tpFont, tpFontSize);
contentStream.newLineAtOffset(50, 700);
} else {
System.out.println("pdf Encrypted");
public class MyPdfTextStripper extends PDFTextStripper {
public MyPdfTextStripper() throws IOException {
public List<List<TextPosition>> getCharactersByArticle() {
return super.getCharactersByArticle();
// Add Pages to CharactersByArticle List
public List<List<List<TextPosition>>> getCharactersByArticleByPage(PDDocument doc) throws IOException {
final int maxPageNr = doc.getNumberOfPages();
List<List<List<TextPosition>>> byPageList = new ArrayList<>(maxPageNr);
for (int pageNr = 1; pageNr <= maxPageNr; pageNr++) {
return byPageList;
Additional Info:
There are seven fonts in my document, all of which are set as subsets.
I need to write the Text given with the corresponding Font given.
All glyphs that should be written already exist in the original document, where I get my TextPositionList from.
All fonts are subtype 1 or 0
There is no AcroForm defined
Thanks in advance
Edit 30.08.2022:
Fixed the Issue by manually replacing this particular Unicode with a placeholder for the String before trying to write it.
Now I ran into this open ToDo:
public byte[] encode(int unicode)
// todo: we can use a known character collection CMap for a CIDFont
// and an Encoding for Type 1-equivalent
throw new UnsupportedOperationException();
Anyone got any suggestions or Workarounds for this?
Edit 01.09.2022
I tried to replace occurrences of that Font with an alternative Font from the source file, but this opens another problem where a COSStream is "randomly" closed, which results in the new document not being able to save the File after writing my text with a contentStream.
Using standard Fonts like PDType1Font.HELVETICA instead works though..
I am having html content store as a raw string in my database and I like to print it in pdf, but with custom size, for example page size to be 10cm width and 7 com height, not standard A4 format.
Can someone gives me some examples if it is possible.
ByteArrayOutputStream out = new ByteArrayOutputStream();
PDRectangle rec = new PDRectangle(recWidth, recHeight);
PDPage page = new PDPage(rec);
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
String htmlContent = "<b>Hello world</b>" + content;
builder.withHtmlContent(htmlContent, "");
PdfBoxRenderer renderer = builder.buildPdfRenderer();
} catch (Exception e) {
return new ByteArrayInputStream(out.toByteArray());
This code generates for me 2 files, one small and one A4.
I tried this one:
try (PDDocument document = new PDDocument()) {
PdfRendererBuilder builder = new PdfRendererBuilder();
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
String htmlContent = "<b>content</b>";
builder.withHtmlContent(htmlContent, "");
PdfBoxRenderer renderer = builder.buildPdfRenderer();
} catch (Exception e) {
log.error(">>> The creation of PDF is invalid!");
But in this case content is not shown, if I remove useDefaultPageSize, content will be shown
I didn't check this solution before, but try initialise the builder object with your desired page size and document type like below
builder.useDefaultPageSize(210, 297, PdfRendererBuilder.PageSizeUnits.MM);
the lib include many PDF format next is PdfAConformance Enum with possible values
PdfAConformance Enum
I'm trying to print a post-processed (filled) PDF-Template, which was created in LibreOffice and contains filled out form field.
The PDFBox svn is nice and has a lot of examples how to do so. Getting the PDF and the AcroFormat of it is easy, and even editing and saving the modified PDF to disk works as expected. But this is not my goal. I want a PDF which has the fields filled out and then being removed with only the text remaining.
I tried everything on stackoverflow regarding PDFBox, from flatting the acroform to setting readonly properties on the fields and other meta info, installed the necessary fonts and much more. Everytime I printed the PDF to file, the text (edited and non edited) which was in a text field disappeared and the textfields were gone.
But then I tried to create a PDF from scratch with PDFBox and printing works like expected. The textfields were in the generated template and the printed pdf file contained the text I wanted, with the corresponding forms removed.
So I used the PDF Debugger from PDFBox to analyse the structure of the PDF and noticed that within the preview of the debugger, my PDF does not contain the text in the text field, exported from LibreOffice. BUT in the tree structure the PDF Annotation is clearly there (/DV and /V) and looks quiet similar to the pdfbox created version, which is working.
For testing I created a simple pdf with just one text field with name "test" and content "Foobar". Also the background and border color were changed to see if anything was successfully printed out.
PDDocument document = null;
try {
document = PDDocument.load(new File("<filepath>\\<filename>"));
} catch (final InvalidPasswordException e) {
// TODO Auto-generated catch block
} catch (final IOException e) {
// TODO Auto-generated catch block
PrintFields.printFields(document); //debug output
//Getting pdf meta infos
final PDDocumentCatalog docCatalog = document.getDocumentCatalog();
final PDAcroForm acroForm = docCatalog.getAcroForm();
//setting the appearance
final PDFont font = PDType1Font.HELVETICA;
final PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
String defaultAppearanceString = "/Helv 0 Tf 0 g";
for(final PDField f : acroForm.getFields()) {
if(f instanceof PDTextField) {
defaultAppearanceString = "/Helv 12 Tf 0 0 1 rg";
final List<PDAnnotationWidget> widgets = ((PDTextField)f).getWidgets();
for(final PDField f : acroForm.getFields()) {
// save modified pdf to file"<filepath>\\<filename>");
//print to file (to pdf)
if (job.printDialog()) {
try {
// Desktop.getDesktop().print();
} catch (final PrinterException e) {
// TODO Auto-generated catch block
// copied from pdfbox examples
public static void createDummyPDF(final String path) throws IOException
// Create a new document with an empty page.
try (PDDocument document = new PDDocument())
final PDPage page = new PDPage(PDRectangle.A4);
// Adobe Acrobat uses Helvetica as a default font and
// stores that under the name '/Helv' in the resources dictionary
final PDFont font = PDType1Font.HELVETICA;
final PDResources resources = new PDResources();
resources.put(COSName.getPDFName("Helv"), font);
// Add a new AcroForm and add that to the document
final PDAcroForm acroForm = new PDAcroForm(document);
// Add and set the resources and default appearance at the form level
// Acrobat sets the font size on the form level to be
// auto sized as default. This is done by setting the font size to '0'
String defaultAppearanceString = "/Helv 0 Tf 0 g";
// Add a form field to the form.
final PDTextField textBox = new PDTextField(acroForm);
// Acrobat sets the font size to 12 as default
// This is done by setting the font size to '12' on the
// field level.
// The text color is set to blue in this example.
// To use black, replace "0 0 1 rg" with "0 0 0 rg" or "0 g".
defaultAppearanceString = "/Helv 12 Tf 0 0 1 rg";
// add the field to the acroform
// Specify the widget annotation associated with the field
final PDAnnotationWidget widget = textBox.getWidgets().get(0);
final PDRectangle rect = new PDRectangle(50, 750, 200, 50);
// set green border and yellow background
// if you prefer defaults, just delete this code block
final PDAppearanceCharacteristicsDictionary fieldAppearance
= new PDAppearanceCharacteristicsDictionary(new COSDictionary());
fieldAppearance.setBorderColour(new PDColor(new float[]{0,1,0}, PDDeviceRGB.INSTANCE));
fieldAppearance.setBackground(new PDColor(new float[]{1,1,0}, PDDeviceRGB.INSTANCE));
// make sure the widget annotation is visible on screen and paper
// Add the widget annotation to the page
// set the field value
textBox.setValue("Sample field");;
//copied from pdfbox examples
public static void processFields(final List<PDField> fields, final PDResources resources) { -> {
final COSDictionary cosObject = f.getCOSObject();
final String value = cosObject.getString(COSName.DV) == null ?
cosObject.getString(COSName.V) : cosObject.getString(COSName.DV);
System.out.println("Setting " + f.getFullyQualifiedName() + ": " + value);
try {
} catch (final IOException e) {
if (e.getMessage().matches("Could not find font: /.*")) {
final String fontName = e.getMessage().replaceAll("^[^/]*/", "");
System.out.println("Adding fallback font for: " + fontName);
resources.put(COSName.getPDFName(fontName), PDType1Font.HELVETICA);
try {
} catch (final IOException e1) {
} else {
if (f instanceof PDNonTerminalField) {
processFields(((PDNonTerminalField) f).getChildren(), resources);
I would expect that the pdfs generated by the and job.print() to look identical in the Viewer, but they do not.
If I take the generated pdf with readonly disabled, I can use a PDF Viewer like FoxitReader to fill the form and print it again. This produces the right output. Using the job.print() version, leads to disappearing of the text contained in the (text) form field.
Has anyone a clue why this is the case?
I'm using PDFBox 2.0.13 (latest release) and LibreOffice
Here are the refered files and here you can download the debugger (jar file, runnable with java -jar ).
I'm trying to generate a PDF report consisting of sentences in multiple languages. For that I'm using Google NOTO fonts, but google CJK fonts don't support some of the Latin special characters. For that reason, my PDFBox is failing to generate a report or sometimes shows weird characters.
Does anyone have any appropriate solution? I tried multiple things, but was unable to find a single TTF file that can support all Unicode. I also tried falling back to different font files, but that will be too much work.
Languages I support: Japanese, German, Spanish, Portuguese, English.
Note: I don't want to use arialuni.ttf file due to licensing issues.
Can anyone suggest anything?
Here is the code that will be in release 2.0.14 in the examples subproject:
* Output a text without knowing which font is the right one. One use case is a worldwide
* address list. Only LTR languages are supported, RTL (e.g. Hebrew, Arabic) are not
* supported so they would appear in the wrong direction.
* Complex scripts (Thai, Arabic, some Indian languages) are also not supported, any output
* will look weird. There is an (unfinished) effort here:
* #author Tilman Hausherr
public class EmbeddedMultipleFonts
public static void main(String[] args) throws IOException
try (PDDocument document = new PDDocument())
PDPage page = new PDPage(PDRectangle.A4);
PDFont font1 = PDType1Font.HELVETICA; // always have a simple font as first one
TrueTypeCollection ttc2 = new TrueTypeCollection(new File("c:/windows/fonts/batang.ttc"));
PDType0Font font2 = PDType0Font.load(document, ttc2.getFontByName("Batang"), true); // Korean
TrueTypeCollection ttc3 = new TrueTypeCollection(new File("c:/windows/fonts/mingliu.ttc"));
PDType0Font font3 = PDType0Font.load(document, ttc3.getFontByName("MingLiU"), true); // Chinese
PDType0Font font4 = PDType0Font.load(document, new File("c:/windows/fonts/mangal.ttf")); // Indian
PDType0Font font5 = PDType0Font.load(document, new File("c:/windows/fonts/ArialUni.ttf")); // Fallback
try (PDPageContentStream cs = new PDPageContentStream(document, page))
List<PDFont> fonts = new ArrayList<>();
cs.newLineAtOffset(20, 700);
showTextMultiple(cs, "abc 한국 中国 भारत 日本 abc", fonts, 20);
static void showTextMultiple(PDPageContentStream cs, String text, List<PDFont> fonts, float size)
throws IOException
// first try all at once
cs.setFont(fonts.get(0), size);
catch (IllegalArgumentException ex)
// do nothing
// now try separately
int i = 0;
while (i < text.length())
boolean found = false;
for (PDFont font : fonts)
String s = text.substring(i, i + 1);
// it works! Try more with this font
int j = i + 1;
for (; j < text.length(); ++j)
String s2 = text.substring(j, j + 1);
if (isWinAnsiEncoding(s2.codePointAt(0)) && font != fonts.get(0))
// Without this segment, the example would have a flaw:
// This code tries to keep the current font, so
// the second "abc" would appear in a different font
// than the first one, which would be weird.
// This segment assumes that the first font has WinAnsiEncoding.
// (all static PDType1Font Times / Helvetica / Courier fonts)
catch (IllegalArgumentException ex)
// it's over
s = text.substring(i, j);
cs.setFont(font, size);
i = j;
found = true;
catch (IllegalArgumentException ex)
// didn't work, will try next font
if (!found)
throw new IllegalArgumentException("Could not show '" + text.substring(i, i + 1) +
"' with the fonts provided");
static boolean isWinAnsiEncoding(int unicode)
String name = GlyphList.getAdobeGlyphList().codePointToName(unicode);
if (".notdef".equals(name))
return false;
return WinAnsiEncoding.INSTANCE.contains(name);
Alternatives to arialuni can be found here:
I compare 2 pdf files and mark highlight on them.
When i using pdfbox to merge it for comparison . It have error missing highlight.
I using this function:
The function to merge 2 file pdfs with all pages of them to side by side.
function void generateSideBySidePDF() {
File pdf1File = new File(FILE1_PATH);
File pdf2File = new File(FILE2_PATH);
File outPdfFile = new File(OUTFILE_PATH);
PDDocument pdf1 = null;
PDDocument pdf2 = null;
PDDocument outPdf = null;
try {
pdf1 = PDDocument.load(pdf1File);
pdf2 = PDDocument.load(pdf2File);
outPdf = new PDDocument();
for(int pageNum = 0; pageNum < pdf1.getNumberOfPages(); pageNum++) {
// Create output PDF frame
PDRectangle pdf1Frame = pdf1.getPage(pageNum).getCropBox();
PDRectangle pdf2Frame = pdf2.getPage(pageNum).getCropBox();
PDRectangle outPdfFrame = new PDRectangle(pdf1Frame.getWidth()+pdf2Frame.getWidth(), Math.max(pdf1Frame.getHeight(), pdf2Frame.getHeight()));
// Create output page with calculated frame and add it to the document
COSDictionary dict = new COSDictionary();
dict.setItem(COSName.TYPE, COSName.PAGE);
dict.setItem(COSName.MEDIA_BOX, outPdfFrame);
dict.setItem(COSName.CROP_BOX, outPdfFrame);
dict.setItem(COSName.ART_BOX, outPdfFrame);
PDPage outPdfPage = new PDPage(dict);
// Source PDF pages has to be imported as form XObjects to be able to insert them at a specific point in the output page
LayerUtility layerUtility = new LayerUtility(outPdf);
PDFormXObject formPdf1 = layerUtility.importPageAsForm(pdf1, pageNum);
PDFormXObject formPdf2 = layerUtility.importPageAsForm(pdf2, pageNum);
// Add form objects to output page
AffineTransform afLeft = new AffineTransform();
layerUtility.appendFormAsLayer(outPdfPage, formPdf1, afLeft, "left" + pageNum);
AffineTransform afRight = AffineTransform.getTranslateInstance(pdf1Frame.getWidth(), 0.0);
layerUtility.appendFormAsLayer(outPdfPage, formPdf2, afRight, "right" + pageNum);
} catch (IOException e) {
} finally {
try {
if (pdf1 != null) pdf1.close();
if (pdf2 != null) pdf2.close();
if (outPdf != null) outPdf.close();
} catch (IOException e) {
Insert this into your code after the "Source PDF pages has to be imported" segment to copy the annotations. The ones of the right PDF must have their rectangle moved.
// copy annotations
PDPage src1Page = pdf1.getPage(pageNum);
PDPage src2Page = pdf2.getPage(pageNum);
for (PDAnnotation ann : src1Page.getAnnotations())
for (PDAnnotation ann : src2Page.getAnnotations())
PDRectangle rect = ann.getRectangle();
ann.setRectangle(new PDRectangle(rect.getLowerLeftX() + pdf1Frame.getWidth(), rect.getLowerLeftY(), rect.getWidth(), rect.getHeight()));
Note that this code has a flaw - it works only with annotations WITH appearance stream (most have it). It will have weird effects for those that don't, in that case, one would have to adjust the coordinates depending on the annotation type. For highlights, it would be the quadpoints, for line it would be the line coordinates, etc, etc.
This question already has answers here:
Writing Arabic with PDFBOX with correct characters presentation form without being separated
(2 answers)
Closed 5 years ago.
Update 1
I'm trying to write some Arabic characters in a pdf document using pdfbox. As a result I get some strange characters. You can find below the code snippet I used for my test. Notice that the same code was used to print Latin characters without any problem.
public static void main(String[] args) throws Exception {
PDDocument document = new PDDocument();
PDPage page = new PDPage(PDPage.PAGE_SIZE_A4);
PDPageContentStream stream = new PDPageContentStream(document, page,true, true);
//Use of a unicode font
PDFont font = PDTrueTypeFont.loadTTF(document,"C:/arialuni.ttf");
font.setFontEncoding(new WinAnsiEncoding());
stream.setFont(font, 12);
stream.moveTextPositionByAmount(40, 600);
stream.drawString("سي ججس ححسيب حسججسيبنم حح ");
Thanks for your help. I tried a Unicode font downloaded from Microsoft website ,but I still have the same result.
Update 2
By using the method 'drawUnicodeString' and the mehod 'loadTTF' I got form the PDFBOX-922
I was able to write arabic charactersm but they are disconnected and ordered from left-to-right. Here are the two methods 'drawUnicodeString' and 'loadTTF'
public void drawUnicodeString(String text) throws IOException {
COSString string = new COSString();
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
string.append(c >> 8);
string.append(c & 0xff);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
public static PDType0Font loadTTF(PDDocument doc, InputStream is)
throws IOException {
/* Load the font which we will convert to Type0 font. */
PDTrueTypeFont pdTtf = PDTrueTypeFont.loadTTF(doc, is);
TrueTypeFont ttf = pdTtf.getTTFFont();
CMAPEncodingEntry unicodeMap = null;
for (CMAPEncodingEntry candidate : ttf.getCMAP().getCmaps()) {
if (candidate.getPlatformId() == CMAPTable.PLATFORM_WINDOWS
&& candidate.getPlatformEncodingId() == CMAPTable.ENCODING_UNICODE) {
unicodeMap = candidate;
if (unicodeMap == null) {
throw new RuntimeException(
"To use as CIDFont, the TTF must have a Windows platform Unicode encoding");
float scaling = 1000f / ttf.getHeader().getUnitsPerEm();
MyPDCIDFontType2Font pdCidFont2 = new MyPDCIDFontType2Font();
pdCidFont2.setFontDescriptor((PDFontDescriptorDictionary) pdTtf
/* Fixme -- should determine the minimum and maximum charcode in the map */
int[] cid2gid = new int[65536];
List<Float> widths = new ArrayList<Float>();
int[] widthValues = ttf.getHorizontalMetrics().getAdvanceWidth();
for (int i = 0; i < cid2gid.length; i++) {
int glyph = unicodeMap.getGlyphId(i);
cid2gid[i] = glyph;
widths.add((float) i);
widths.add((float) i);
widths.add(widthValues[glyph] * scaling);
/* Now construct the type0 font that we actually return */
myType0Font pdFont0 = new myType0Font();
pdFont0.setDescendantFonts(new COSObject(pdCidFont2.getCOSObject()));
// pdfont0.setToUnicode(COSName.IDENTITY_H); XXX how to express identity
// mapping as ToUnicode program? */
return pdFont0;
and here are the characters printed :
I don't know why these characters are disconnected
Arabic can be written by applying both PDFBOX-922 and PDFBOX-1287 .(the diff files are attached to in issues description)
I hope that the patches will be applied in the version 2.0.
i suggest you try adding ICU4J jars to your project :