How to substitute missing font when filling a form with PDFBox? - java

I'm trying to fill out a bunch of PDF Forms using PDFBox 2.0.8. For some documents I get the following error when setting the PDTextField's value:
java.io.IOException: Could not find font: /ArialMT
Apparently the font is not correctly embedded as is often the case with proprietary Microsoft fonts.
How can I tell PDFBox to substitute the font e.g. with "normal" Arial or some other font? Setting the fields DA string to "/Helv 0 tf 0 g" resulted in a NullPointerException.

Based on the comments from Tilman Hausherr I built a first fix which works independent from the operating system (which is a Linux in my case).
acroForm.defaultResources.put(COSName.getPDFName("ArialMT"),
PDType0Font.load (pdDocument, this.javaClass.classLoader.getResourceAsStream("fonts/ARIALMT.ttf"), false))
This will only work for this particular font, though. What's still missing - and was actually the main intention of my question - is an option to tell PDFBox to fall back to a certain font resp. DA if the font that is required cannot be provided.
After Tilman again came for the rescue I can now present the complete solution. Again, this is Kotlin, not Java:
PDDocument.load(file).use { pdDocument ->
val acroForm = pdDocument.documentCatalog.acroForm
acroForm.defaultResources.put(COSName.getPDFName("ArialMT"),
PDType0Font.load (pdDocument, this.javaClass.classLoader.getResourceAsStream("fonts/ARIALMT.ttf"), false))
val pdField: PDField? = acroForm.getField(fieldname)
val value = ...
when (pdField) {
is PDCheckBox -> {
if (value is Boolean) {
when (value) {
true -> pdField.check()
false -> pdField.unCheck()
}
} else {
log.error("RENDER_FORM: Need Boolean for ${pdField.fullyQualifiedName} but got $value")
}
}
is PDTextField -> {
try {
pdField.value = value?.toString() ?: ""
} catch (ioException: IOException) {
pdField.cosObject.setString(COSName.DA, "/Helv 0 Tf 0 g")
pdField.value = value?.toString() ?: ""
log.error("RENDER_FORM: Writing text field failed: ${ioException.message}")
}
}
null -> {
log.error("RENDER_FORMULAR: Formfield $fieldname does not exist in $name")
}
else -> log.error("RENDER_FORMULAR: Formfield $pdField ($fieldname) is of unhandled type ${pdField.fieldType}")
}
val stream = ByteArrayOutputStream()
pdDocument.save(stream)
pdDocument.close()
return stream.toByteArray()
}

Add "ArialMT" to the default resources:
try (PDDocument doc = PDDocument.load(new File("F2_Datenblatt_022015.pdf")))
{
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDField field = acroForm.getField("Vorname_Name");
// fails with IOException as described in question
//field.setValue("Tilman Hausherr");
// Method 1, just add type1 Helvetica (allows only WinAnsiEncoding glyphs)
//acroForm.getDefaultResources().put(COSName.getPDFName("ArialMT"), PDType1Font.HELVETICA);
// Method 2, add the full Arial font (allows for more different glyphs)
// important: use the method that switches off subsetting
acroForm.getDefaultResources().put(
COSName.getPDFName("ArialMT"),
PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arial.ttf"), false));
field.setValue("Tilman Hausherr");
doc.save("F2_Datenblatt_022015-mod.pdf");
}
Update:
Turns out the code in the question would have worked too with the file - almost. It's "Tf" and not "tf", so the string would have been "/Helv 0 Tf 0 g". We'll research how to avoid an NPE and get a meaningful exception.

Related

PDFBox in Java trying to Edit Pdf with existing acroForms but values are hidden untill i press on them

I am using PDFBox to get a document that was already generated from a Nestjs using PDF-lib js via the command form.createTextField(field.id); after that i send it to java so i can but a signature box ontop of it and fill the forms now the forms are filled and everything works with pdf viewer js
i can see the fields and the values but when i try to open the pdf file in google chrome i dont see the values at all or when i try to open that in Adobe reader i dont see the values untill i click on the field
here is my java code
public void prepareForSigning(DigestAlgorithm digestAlgorithm,
SignatureType signatureType,
UserData userData, List<FieldInput> formFields) throws IOException, NoSuchAlgorithmException {
this.digestAlgorithm = digestAlgorithm;
id = Utils.generateDocumentId();
pdDocument = PDDocument.load(contentIn);
int accessPermissions = getDocumentPermissions();
if (accessPermissions == 1) {
throw new AisClientException("Cannot sign document [" + name + "]. Document contains a certification " +
"that does not allow any changes.");
}
// add fields
// get the document catalog
try {
PDAcroForm acroForm = pdDocument.getDocumentCatalog().getAcroForm();
acroForm.setSignaturesExist(true);
acroForm.setAppendOnly(true);
acroForm.getCOSObject().setDirect(true);
acroForm.getCOSObject().setNeedToBeUpdated(true);
// acroForm.setNeedAppearances(true);
COSObject pdfFields = acroForm.getCOSObject().getCOSObject(COSName.FIELDS);
if (pdfFields != null) {
pdfFields.setNeedToBeUpdated(true);
}
for (int i = 0; i < formFields.size(); i++) {
PDField field = acroForm.getField(formFields.get(i).id);
if (field != null) {
// will also set a checkbox if the value is Yes
// checking for formFields.get(i).value == "true" returns
if (field.getFieldType() == "Btn" && formFields.get(i).value.equals("true")) {
field.setValue("Yes");
} else {
field.setValue(formFields.get(i).value);
}
field.setReadOnly(true);
field.getCOSObject().setNeedToBeUpdated(true);
field.getWidgets().get(0).getAppearance().getCOSObject().setNeedToBeUpdated(true);
Log.info("set field: " + field.getFullyQualifiedName() + " to " + formFields.get(i).value);
}
}
pdDocument.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);
} catch (Exception e) {
Log.warn(e);
}
PDSignature pdSignature = new PDSignature();
Calendar signDate = Calendar.getInstance();
if (signatureType == SignatureType.TIMESTAMP) {
// Now, according to ETSI TS 102 778-4, annex A.2, the type of a Dictionary that
// holds document timestamp should be DocTimeStamp
// However, adding this (as of Feb/17/2021), it trips the ETSI Conformance
// Checked online tool, making it say
// "There is no signature dictionary in the document". So, for now (Feb/17/2021)
// this has been removed. This makes the
// ETSI Conformance Checker happy.
// pdSignature.setType(COSName.DOC_TIME_STAMP);
pdSignature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
pdSignature.setSubFilter(COSName.getPDFName("ETSI.RFC3161"));
} else {
pdSignature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
pdSignature.setSubFilter(PDSignature.SUBFILTER_ETSI_CADES_DETACHED);
// Add 3 Minutes to move signing time within the OnDemand Certificate Validity
// This is only relevant in case the signature does not include a timestamp
// See section 5.8.5.1 of the Reference Guide
signDate.add(Calendar.MINUTE, 3);
}
pdSignature.setSignDate(signDate);
pdSignature.setName(userData.getSignatureName());
pdSignature.setReason(userData.getSignatureReason());
pdSignature.setLocation(userData.getSignatureLocation());
pdSignature.setContactInfo(userData.getSignatureContactInfo());
SignatureOptions options = new SignatureOptions();
options.setPreferredSignatureSize(signatureType.getEstimatedSignatureSizeInBytes());
// create a visible signature at the specified coordinates
if (signatureDefinition != null) {
Rectangle2D humanRect = new Rectangle2D.Float(signatureDefinition.getX(),
signatureDefinition.getY(),
signatureDefinition.getWidth(),
signatureDefinition.getHeight());
PDRectangle rect = createSignatureRectangle(pdDocument, humanRect);
options.setVisualSignature(
createVisualSignatureTemplate(pdDocument, signatureDefinition.getPage(),
signatureDefinition.getImage(), rect, pdSignature));
options.setPage(signatureDefinition.getPage());
}
pdDocument.addSignature(pdSignature, options);
// Set this signature's access permissions level to 0, to ensure we just sign
// the PDF, not certify it
// for more details:
// https://wwwimages2.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
// see section 12.7.4.5
setPermissionsForSignatureOnly();
pbSigningSupport = pdDocument.saveIncrementalForExternalSigning(inMemoryStream);
MessageDigest digest = MessageDigest.getInstance(digestAlgorithm.getDigestAlgorithm());
byte[] contentToSign = IOUtils.toByteArray(pbSigningSupport.getContent());
byte[] hashToSign = digest.digest(contentToSign);
options.close();
base64HashToSign = Base64.getEncoder().encodeToString(hashToSign);
}
now the field with value 5 is appearing because i already clicked on it which is on focus() mode
adobe reader
when i use acoForm.setNeedAppearances to true i can then see the values but then the signature field is not there am i missing something in code ?
i am expecting to see the values in google chrome or Adobe Reader appearing without me pressing on them
Picture of the pdf fields without values with one field being focused on
PDF SAMLE FILE

PDFBox invalid option in radio

When trying to fill the form of this pdf (http://vaielab.com/Test/2.pdf) with this code
PDDocument pdfDocument = PDDocument.load(new File("2.pdf"));
pdfDocument.setAllSecurityToBeRemoved(true);
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
PDField field = (PDField) acroForm.getField("rad2");
try {
field.setValue("0");
} catch (Exception e) {
System.out.println(e);
}
}
pdfDocument.save("output.pdf");
pdfDocument.close();
I get this error: value '0' is not a valid option for the field rad2, valid values are: [Yes] and Off
But value "0" should be a valid option, and if I do a dump_data_fields with pdftk, I get this:
FieldType: Button
FieldName: rad2
FieldFlags: 49152
FieldJustification: Left
FieldStateOption: 0
FieldStateOption: 1
FieldStateOption: Off
FieldStateOption: Yes
I also tried the value "1" but get the exact same error.
I'm using pdfbox 2.0.20
This is because of the Opt values in Root/AcroForm/Fields/[7]/Opt, that one has two "Yes" entries only. The PDButton.setValue() code in PDFBox updates this field differently when /Opt is set. The best here would be not to set it, or remove these entries by calling field.setExportValues(null) . Then valid settings would be 0, 1 and "Off".

How to read PDF sections using Header font size using PDFBox?

I am trying to read PDF documents and I need them to be separated by sections using header font size or font and font size I currently have it implemented based on the answer of this post. But due to my PDF having the same font for header and the sub-header I need to modify the code so it would search based on font size or both.
List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
new TextSectionDefinition("Section", x -> x.get(0).get(0).getFont().getName().contains("Calibri,Bold"), TextSectionDefinition.MultiLine.multiLineHeader, true)
);
document.getClass();
PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
stripper.getText(document);
System.out.println("Sections:");
List<String> texts = new ArrayList<>();
for (TextSection textSection : stripper.getSections()) {
String text = textSection.toString();
System.out.println(text);
texts.add(text);
}
return ResponseEntity.ok(texts);
My problem stems if I try to use getFontSize instead of getFont it doesn't allow any parameters to be entered, in my case 16 (font size).
In the answer you refer to there are text section definitions like this:
new TextSectionDefinition("Titel",
x->x.get(0).get(0).getFont().getName().contains("CMBX12"),
MultiLine.singleLine,
false)
I assume your remark
if I try to use getFontSize instead of getFont it doesn't allow any parameters to be entered, in my case 16
indicates that you want to exchange the lambda expression in the second parameter
x->x.get(0).get(0).getFont().getName().contains("CMBX12")
by something that tests the font size. Thus, have you tried replacing it by
x->x.get(0).get(0).getFontSize() == 16
or
x->x.get(0).get(0).getFontSizeInPt() == 16
or
x-> {
float size = x.get(0).get(0).getFontSizeInPt();
return size > 15 && size < 17;
}
yet?

Why pdf contain one field only is around 500Kb

Here you can download pdf with one acroform field and his size is exactly 427Kb
If I remove this unique field, file is 3Kb only, why this happens please ?
I tried analyse using PDF Debugger and nothing seems weird to me.
There's an embedded "Arial" font in the acroform default resources, see Root/AcroForm/DR/Font/Arial/FontDescriptor/FontFile2.
Either you or whoever created the pdf added it for no reason. The font is not used / referenced. For the acroform default resources you could check the /DA entry (default appearance) of each field whether it contains the font name.
When you removed the field somehow you also removed the font from the acroForm default resources. (You didn't write how you removed it)
Here's some code to do it (null checks mostly missing):
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDResources defaultResources = acroForm.getDefaultResources();
COSDictionary fontDict = (COSDictionary) defaultResources.getCOSObject().getDictionaryObject(COSName.FONT);
List<String> defaultAppearances = new ArrayList<>();
List<COSName> fontDeletionList = new ArrayList<>();
for (PDField field : acroForm.getFieldTree())
{
if (field instanceof PDVariableText)
{
PDVariableText vtField = (PDVariableText) field;
defaultAppearances.add(vtField.getDefaultAppearance());
}
}
for (COSName fontName : defaultResources.getFontNames())
{
if (COSName.HELV.equals(fontName) || COSName.ZA_DB.equals(fontName))
{
// Adobe default, always keep
continue;
}
boolean found = false;
for (String da : defaultAppearances)
{
if (da != null && da.contains("/" + fontName.getName()))
{
found = true;
break;
}
}
System.out.println(fontName + ": " + found);
if (!found)
{
fontDeletionList.add(fontName);
}
}
System.out.println("deletion list: " + fontDeletionList);
for (COSName fontName : fontDeletionList)
{
fontDict.removeItem(fontName);
}
The resulting file has 5KB size now.
I haven't checked the annotations. Some of them have also a /DA string but it is unclear if the acroform default resources fonts are to be used when reconstructing a missing appearance stream.
Update:
Here's some additional code to replace Arial with Helv:
for (PDField field : acroForm.getFieldTree())
{
if (field instanceof PDVariableText)
{
PDVariableText vtField = (PDVariableText) field;
String defaultAppearance = vtField.getDefaultAppearance();
if (defaultAppearance.startsWith("/Arial"))
{
vtField.setDefaultAppearance("/Helv " + defaultAppearance.substring(7));
vtField.getWidgets().get(0).setAppearance(null); // this removes the font usage
vtField.setValue(vtField.getValueAsString());
}
defaultAppearances.add(vtField.getDefaultAppearance());
}
}
Note that this may not be a good idea, because the standard 14 fonts have only limited characters. Try
vtField.setValue("Ayşe");
and you'll get an exception.
More general code to replace font can be found in this answer.

Exception in thread "main" org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist

I am trying to run the Java Code written by Stefano Chizzolini (Awesome guy : Creator of PDFClown) to Parse a PDF using PDF Clown library. I am getting this error and I dont know what I can do to fix this.
Exception in thread "main" org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
at org.pdfclown.documents.contents.fonts.OpenFontParser.getName(OpenFontParser.java:570)
at org.pdfclown.documents.contents.fonts.OpenFontParser.load(OpenFontParser.java:221)
at org.pdfclown.documents.contents.fonts.OpenFontParser.<init>(OpenFontParser.java:205)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:626)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at PDFReader.FullExtract.run(FullExtract.java:71)
at PDFReader.FullExtract.main(FullExtract.java:142)
I know the class OpenFontParser in the library package is throwing this error. Is there anything I can do to fix this?
This code works for most PDF. I have a PDF that it does not parse. I am guessing it is because of this symbol below in the pdf.
public class PDFReader extends Sample {
#Override
public void run()
{
String filePath = new String("C:\\Users\\XYZ\\Desktop\\SomeSamplePDF.pdf");
// 1. Open the PDF file!
File file;
try
{file = new File(filePath);}
catch(Exception e)
{throw new RuntimeException(filePath + " file access error.",e);}
// 2. Get the PDF document!
Document document = file.getDocument();
// 3. Extracting text from the document pages...
for(Page page : document.getPages())
{
extract(new ContentScanner(page)); // Wraps the page contents into a scanner.
}
close(file);
}
private void close(File file) {
// TODO Auto-generated method stub
}
/**
Scans a content level looking for text.
*/
/*
NOTE: Page contents are represented by a sequence of content objects,
possibly nested into multiple levels.
*/
private void extract(
ContentScanner level
)
{
if(level == null)
return;
while(level.moveNext())
{
ContentObject content = level.getCurrent();
if(content instanceof ShowText)
{
Font font = level.getState().getFont();
// Extract the current text chunk, decoding it!
System.out.println(font.decode(((ShowText)content).getText()));
}
else if(content instanceof Text
|| content instanceof ContainerObject)
{
// Scan the inner level!
extract(level.getChildLevel());
}
}
}
private boolean prompt(Page page)
{
int pageIndex = page.getIndex();
if(pageIndex > 0)
{
Map<String,String> options = new HashMap<String,String>();
options.put("", "Scan next page");
options.put("Q", "End scanning");
if(!promptChoice(options).equals(""))
return false;
}
System.out.println("\nScanning page " + (pageIndex+1) + "...\n");
return true;
}
public static void main(String args[])
{
new PDFReader().run();
}
}
The issue
As the stacktrace indicates, the problem is that some TrueType font embedded in the PDF does not contain a name table even though it is a required table:
org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
...
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
Thus, strictly speaking, that embedded font is invalid and consequentially the embedding PDF, too. And PDFClown runs into an exception due to this validity issue.
Some backgrounds
A TrueType font file consists of a sequence of concatenated tables. ...
The first of the tables is the font directory, a special table that facilitates access to the other tables in the font. The directory is followed by a sequence of tables containing the font data. These tables can appear in any order. Certain tables are required for all fonts. Others are optional depending upon the functionality expected of a particular font.
Tables that are required must appear in any valid TrueType font file. The required tables and their tag names are shown in Table 2.
Table 2: The required tables
Tag Table
'cmap' character to glyph mapping
'glyf' glyph data
'head' font header
'hhea' horizontal header
'hmtx' horizontal metrics
'loca' index to location
'maxp' maximum profile
'name' naming
'post' PostScript
(Section TrueType Font files: an overview in chapter 6 The TrueType Font File in the TrueType Reference Manual)
On the other hand, though, there are a number of PDF generators cutting down embedded TrueType fonts to the bare essentials required by PDF viewers (foremost Adobe Reader), and the name table does not seem to be strictly required.
Furthermore the table name is only used for one purpose in PDFClown, to determine the name of the font in question, even though the font name could be determined from the BaseFont entry of the associated font dictionary, too. Actually the latter entry is required by the PDF specification while the PostScript name of the font entry in the name table is optional according to the TTF manual.
Thus, using the BaseFont entry in the PDF font dictionary would be a better alternative to this name table access.
Fixing it
Is there anything I can do to fix this?
You can either fix the not entirely valid PDF by adding a name table to the embedded TTF in question or you can patch PDFClown to ignore the missing missing table: in the class org.pdfclown.documents.contents.fonts.OpenFontParser edit the method getName:
private String getName(
int id
) throws EOFException, UnsupportedEncodingException
{
// Naming Table ('name' table).
Integer tableOffset = tableOffsets.get("name");
if(tableOffset == null)
throw new ParseException("'name' table does NOT exist.");
Replace that throw new ParseException("'name' table does NOT exist.") by return null.
PS
While the problem could be analyzed using merely the information given by the OP, the sample file provided by #akarshad in his now deleted answer gave more motivation to start the analysis at all.

Categories