Why pdf contain one field only is around 500Kb

Why pdf contain one field only is around 500Kb - java

Here you can download pdf with one acroform field and his size is exactly 427Kb
If I remove this unique field, file is 3Kb only, why this happens please ?
I tried analyse using PDF Debugger and nothing seems weird to me.

There's an embedded "Arial" font in the acroform default resources, see Root/AcroForm/DR/Font/Arial/FontDescriptor/FontFile2.
Either you or whoever created the pdf added it for no reason. The font is not used / referenced. For the acroform default resources you could check the /DA entry (default appearance) of each field whether it contains the font name.
When you removed the field somehow you also removed the font from the acroForm default resources. (You didn't write how you removed it)
Here's some code to do it (null checks mostly missing):
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDResources defaultResources = acroForm.getDefaultResources();
COSDictionary fontDict = (COSDictionary) defaultResources.getCOSObject().getDictionaryObject(COSName.FONT);
List<String> defaultAppearances = new ArrayList<>();
List<COSName> fontDeletionList = new ArrayList<>();
for (PDField field : acroForm.getFieldTree())
{
if (field instanceof PDVariableText)
{
PDVariableText vtField = (PDVariableText) field;
defaultAppearances.add(vtField.getDefaultAppearance());
}
}
for (COSName fontName : defaultResources.getFontNames())
{
if (COSName.HELV.equals(fontName) || COSName.ZA_DB.equals(fontName))
{
// Adobe default, always keep
continue;
}
boolean found = false;
for (String da : defaultAppearances)
{
if (da != null && da.contains("/" + fontName.getName()))
{
found = true;
break;
}
}
System.out.println(fontName + ": " + found);
if (!found)
{
fontDeletionList.add(fontName);
}
}
System.out.println("deletion list: " + fontDeletionList);
for (COSName fontName : fontDeletionList)
{
fontDict.removeItem(fontName);
}
The resulting file has 5KB size now.
I haven't checked the annotations. Some of them have also a /DA string but it is unclear if the acroform default resources fonts are to be used when reconstructing a missing appearance stream.
Update:
Here's some additional code to replace Arial with Helv:
for (PDField field : acroForm.getFieldTree())
{
if (field instanceof PDVariableText)
{
PDVariableText vtField = (PDVariableText) field;
String defaultAppearance = vtField.getDefaultAppearance();
if (defaultAppearance.startsWith("/Arial"))
{
vtField.setDefaultAppearance("/Helv " + defaultAppearance.substring(7));
vtField.getWidgets().get(0).setAppearance(null); // this removes the font usage
vtField.setValue(vtField.getValueAsString());
}
defaultAppearances.add(vtField.getDefaultAppearance());
}
}
Note that this may not be a good idea, because the standard 14 fonts have only limited characters. Try
vtField.setValue("Ayşe");
and you'll get an exception.
More general code to replace font can be found in this answer.

Related

PDFBox in Java trying to Edit Pdf with existing acroForms but values are hidden untill i press on them

I am using PDFBox to get a document that was already generated from a Nestjs using PDF-lib js via the command form.createTextField(field.id); after that i send it to java so i can but a signature box ontop of it and fill the forms now the forms are filled and everything works with pdf viewer js
i can see the fields and the values but when i try to open the pdf file in google chrome i dont see the values at all or when i try to open that in Adobe reader i dont see the values untill i click on the field
here is my java code
public void prepareForSigning(DigestAlgorithm digestAlgorithm,
SignatureType signatureType,
UserData userData, List<FieldInput> formFields) throws IOException, NoSuchAlgorithmException {
this.digestAlgorithm = digestAlgorithm;
id = Utils.generateDocumentId();
pdDocument = PDDocument.load(contentIn);
int accessPermissions = getDocumentPermissions();
if (accessPermissions == 1) {
throw new AisClientException("Cannot sign document [" + name + "]. Document contains a certification " +
"that does not allow any changes.");
}
// add fields
// get the document catalog
try {
PDAcroForm acroForm = pdDocument.getDocumentCatalog().getAcroForm();
acroForm.setSignaturesExist(true);
acroForm.setAppendOnly(true);
acroForm.getCOSObject().setDirect(true);
acroForm.getCOSObject().setNeedToBeUpdated(true);
// acroForm.setNeedAppearances(true);
COSObject pdfFields = acroForm.getCOSObject().getCOSObject(COSName.FIELDS);
if (pdfFields != null) {
pdfFields.setNeedToBeUpdated(true);
}
for (int i = 0; i < formFields.size(); i++) {
PDField field = acroForm.getField(formFields.get(i).id);
if (field != null) {
// will also set a checkbox if the value is Yes
// checking for formFields.get(i).value == "true" returns
if (field.getFieldType() == "Btn" && formFields.get(i).value.equals("true")) {
field.setValue("Yes");
} else {
field.setValue(formFields.get(i).value);
}
field.setReadOnly(true);
field.getCOSObject().setNeedToBeUpdated(true);
field.getWidgets().get(0).getAppearance().getCOSObject().setNeedToBeUpdated(true);
Log.info("set field: " + field.getFullyQualifiedName() + " to " + formFields.get(i).value);
}
}
pdDocument.getDocumentCatalog().getCOSObject().setNeedToBeUpdated(true);
} catch (Exception e) {
Log.warn(e);
}
PDSignature pdSignature = new PDSignature();
Calendar signDate = Calendar.getInstance();
if (signatureType == SignatureType.TIMESTAMP) {
// Now, according to ETSI TS 102 778-4, annex A.2, the type of a Dictionary that
// holds document timestamp should be DocTimeStamp
// However, adding this (as of Feb/17/2021), it trips the ETSI Conformance
// Checked online tool, making it say
// "There is no signature dictionary in the document". So, for now (Feb/17/2021)
// this has been removed. This makes the
// ETSI Conformance Checker happy.
// pdSignature.setType(COSName.DOC_TIME_STAMP);
pdSignature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
pdSignature.setSubFilter(COSName.getPDFName("ETSI.RFC3161"));
} else {
pdSignature.setFilter(PDSignature.FILTER_ADOBE_PPKLITE);
pdSignature.setSubFilter(PDSignature.SUBFILTER_ETSI_CADES_DETACHED);
// Add 3 Minutes to move signing time within the OnDemand Certificate Validity
// This is only relevant in case the signature does not include a timestamp
// See section 5.8.5.1 of the Reference Guide
signDate.add(Calendar.MINUTE, 3);
}
pdSignature.setSignDate(signDate);
pdSignature.setName(userData.getSignatureName());
pdSignature.setReason(userData.getSignatureReason());
pdSignature.setLocation(userData.getSignatureLocation());
pdSignature.setContactInfo(userData.getSignatureContactInfo());
SignatureOptions options = new SignatureOptions();
options.setPreferredSignatureSize(signatureType.getEstimatedSignatureSizeInBytes());
// create a visible signature at the specified coordinates
if (signatureDefinition != null) {
Rectangle2D humanRect = new Rectangle2D.Float(signatureDefinition.getX(),
signatureDefinition.getY(),
signatureDefinition.getWidth(),
signatureDefinition.getHeight());
PDRectangle rect = createSignatureRectangle(pdDocument, humanRect);
options.setVisualSignature(
createVisualSignatureTemplate(pdDocument, signatureDefinition.getPage(),
signatureDefinition.getImage(), rect, pdSignature));
options.setPage(signatureDefinition.getPage());
}
pdDocument.addSignature(pdSignature, options);
// Set this signature's access permissions level to 0, to ensure we just sign
// the PDF, not certify it
// for more details:
// https://wwwimages2.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
// see section 12.7.4.5
setPermissionsForSignatureOnly();
pbSigningSupport = pdDocument.saveIncrementalForExternalSigning(inMemoryStream);
MessageDigest digest = MessageDigest.getInstance(digestAlgorithm.getDigestAlgorithm());
byte[] contentToSign = IOUtils.toByteArray(pbSigningSupport.getContent());
byte[] hashToSign = digest.digest(contentToSign);
options.close();
base64HashToSign = Base64.getEncoder().encodeToString(hashToSign);
}
now the field with value 5 is appearing because i already clicked on it which is on focus() mode
adobe reader
when i use acoForm.setNeedAppearances to true i can then see the values but then the signature field is not there am i missing something in code ?
i am expecting to see the values in google chrome or Adobe Reader appearing without me pressing on them
Picture of the pdf fields without values with one field being focused on
PDF SAMLE FILE

How to add text outlines to text within Powerpoint via Apache POI:

Does anyone have an idea how we can add outlines to text (text outline) within powerpoint templates (ppxt) using Apache POI? What I have gathered so far is that the XSLFTextRun class does not have a method to get/ set the text outline for a given run element.
And as such, I could only persist the following font/ text styles:
def fontStyles(textBox: XSLFTextBox, textRun: XSLFTextRun): Unit = {
val fontFamily = textRun.getFontFamily
val fontColor = textRun.getFontColor
val fontSize = textRun.getFontSize
val fontBold = textRun.isBold
val fontItalic = textRun.isItalic
val textAlign = textRun.getParagraph.getTextAlign
textBox.getTextParagraphs.foreach { p =>
p.getTextRuns.foreach { tr =>
tr.setFontFamily(fontFamily)
tr.setFontColor(fontColor)
tr.setFontSize(fontSize)
tr.setBold(fontBold)
tr.setItalic(fontItalic)
tr.getParagraph.setTextAlign(textAlign)
}
}
}
Is it possible to add text outline?
Any assistance/ suggestions would be highly appreciated.

Apache poi uses underlying ooxml-schemas classes. Those are auto generated from Office Open XML standard. So they are more complete than the high level XSLF classes. Of course they are much less convenient.
So if somewhat is not implemented in high level XSLF classes, we can get the underlying CT classes and do it using those. In case of XSLFTextRun we can get the CTRegularTextRun object. Then we can look whether there are run properties already. If not, we add one. Then we look whether there is outline set already. If so, we unset it, because we want set it new. Then we set a new outline. This simply is a line having a special color. That line is represented by CTLineProperties object. So we need to have methods to create that CTLineProperties, to set CTLineProperties to the XSLFTextRun and get CTLineProperties from XSLFTextRun.
Complete example using Java code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xslf.usermodel.*;
import org.apache.poi.sl.usermodel.*;
import java.awt.Rectangle;
public class PPTXTextRunOutline {
static org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties createSolidFillLineProperties(java.awt.Color color) {
// create new CTLineProperties
org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties lineProperties
= org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties.Factory.newInstance();
// set line solid fill color
lineProperties.addNewSolidFill().addNewSrgbClr().setVal(new byte[]{(byte)color.getRed(), (byte)color.getGreen(), (byte)color.getBlue()});
return lineProperties;
}
static void setOutline(XSLFTextRun run, org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties lineProperties) {
// get underlying CTRegularTextRun object
org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun ctRegularTextRun
= (org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun)run.getXmlObject();
// Are there run properties already? If not, add one.
if (ctRegularTextRun.getRPr() == null) ctRegularTextRun.addNewRPr();
// Is there outline set already? If so, unset it, because we are creating it new.
if (ctRegularTextRun.getRPr().isSetLn()) ctRegularTextRun.getRPr().unsetLn();
// set a new outline
ctRegularTextRun.getRPr().setLn(lineProperties);
}
static org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties getOutline(XSLFTextRun run) {
// get underlying CTRegularTextRun object
org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun ctRegularTextRun
= (org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun)run.getXmlObject();
// Are there run properties already? If not, return null.
if (ctRegularTextRun.getRPr() == null) return null;
// get outline, may be null
org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties lineProperties = ctRegularTextRun.getRPr().getLn();
// make a copy to avoid orphaned exceptions or value disconnected exception when set to its own XML parent
if (lineProperties != null) lineProperties = (org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties)lineProperties.copy();
return lineProperties;
}
// your method fontStyles taken to Java code
static void fontStyles(XSLFTextRun templateRun, XSLFTextShape textShape) {
String fontFamily = templateRun.getFontFamily();
PaintStyle fontColor = templateRun.getFontColor();
Double fontSize = templateRun.getFontSize();
boolean fontBold = templateRun.isBold();
boolean fontItalic = templateRun.isItalic();
TextParagraph.TextAlign textAlign = templateRun.getParagraph().getTextAlign();
org.openxmlformats.schemas.drawingml.x2006.main.CTLineProperties lineProperties = getOutline(templateRun);
for (XSLFTextParagraph paragraph : textShape.getTextParagraphs()) {
for (XSLFTextRun run : paragraph.getTextRuns()) {
run.setFontFamily(fontFamily);
if(run != templateRun) run.setFontColor(fontColor); // set PaintStyle has the issue which I am avoiding by using a copy of the underlying XML
run.setFontSize(fontSize);
run.setBold(fontBold);
run.setItalic(fontItalic);
run.getParagraph().setTextAlign(textAlign);
setOutline(run, lineProperties);
}
}
}
public static void main(String[] args) throws Exception {
XMLSlideShow slideShow = new XMLSlideShow(new FileInputStream("./PPTXIn.pptx"));
XSLFSlide slide = slideShow.getSlides().get(0);
//as in your code, get a template text run and set its font style to all other runs in text shape
if (slide.getShapes().size() > 0) {
XSLFShape shape = slide.getShapes().get(0);
if (shape instanceof XSLFTextShape) {
XSLFTextShape textShape = (XSLFTextShape) shape;
XSLFTextParagraph paragraph = null;
if(textShape.getTextParagraphs().size() > 0) paragraph = textShape.getTextParagraphs().get(0);
if (paragraph != null) {
XSLFTextRun run = null;
if(paragraph.getTextRuns().size() > 0) run = paragraph.getTextRuns().get(0);
if (run != null) {
fontStyles(run, textShape);
}
}
}
}
//new text box having outlined text from scratch
XSLFTextBox textbox = slide.createTextBox();
textbox.setAnchor(new Rectangle(100, 300, 570, 80));
XSLFTextParagraph paragraph = null;
if(textbox.getTextParagraphs().size() > 0) paragraph = textbox.getTextParagraphs().get(0);
if(paragraph == null) paragraph = textbox.addNewTextParagraph();
XSLFTextRun run = paragraph.addNewTextRun();
run.setText("Test text outline");
run.setFontSize(60d);
run.setFontColor(java.awt.Color.YELLOW);
setOutline(run, createSolidFillLineProperties(java.awt.Color.BLUE));
FileOutputStream out = new FileOutputStream("./PPTXOit.pptx");
slideShow.write(out);
out.close();
}
}
Tested and works using current apache poi 5.0.0.

Add FormXobject content from resources to content stream using PDFBox?

I have FormXobject under my page1->Resource -> Xobjects-> Fm0, Fm1, Fm2..
So it is not direct content stream which is not available under contents->contentstream. So I want to move the content stream of from Fm0->Contentstream to page1-> contents-> contentstream.
When we moved content stream like this we parallelly we have to transfer or copy Fm0 related Resources to page level resource.
1.Content stream need to copy under page level contents.
2.Color space objects need to copy under page1->Resource->Colorspace.
3.ExtGState objects need to copy under page1->Resource->ExtGState.
4.properties need to copy under page1->Resource (here need to create that entirely)
I tried some code
private static PDDocument parseFormXobject(PDDocument document, Integer pg_ind) throws IOException {
List<Object> tokens1 = (List<Object>) (getTokens(document, pg_ind)).get(pg_ind);
PDStream newContents = new PDStream(document);
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
PDPage pageinner = document.getPage(pg_ind);
PDResources resources = pageinner.getResources();
PDResources new_resources = new PDResources();
new_resources = resources;
COSDictionary fntdict = new COSDictionary();
COSDictionary imgdict = new COSDictionary();
COSDictionary extgsdict = new COSDictionary();
COSDictionary colordict = new COSDictionary();
COSDictionary pattern = new COSDictionary();
int img_count = 0;
for (COSName xObjectName : resources.getXObjectNames()) {
PDXObject xObject = resources.getXObject(xObjectName);
if (xObject instanceof PDFormXObject
&& tokens1.toString().contains(xObjectName.toString()) ) {
PDFStreamParser parser = new PDFStreamParser(((PDFormXObject) xObject).getContentStream());
parser.parse();
List<Object> tokens3 = parser.getTokens();
int ind =0;
//isTextContains will check is there any Tj operators or there or not
if (isTextContains(tokens3)){
for (COSName colorname :((PDFormXObject) xObject).getResources().getColorSpaceNames())
{
COSName new_name = COSName.getPDFName(colorname.getName());
PDColorSpace pdcolor = ((PDFormXObject) xObject).getResources().getColorSpace(colorname);
colordict.setItem(new_name,pdcolor);
}
for (COSName fontName :((PDFormXObject) xObject).getResources().getFontNames() )
{
COSName new_name = COSName.getPDFName(fontName.getName());
PDFont font =((PDFormXObject) xObject).getResources().getFont(fontName);
font.getCOSObject().setItem(COSName.NAME, new_name);
fntdict.setItem(new_name,font);
}
for (COSName ExtGSName :((PDFormXObject) xObject).getResources().getExtGStateNames() )
{
COSName new_name = COSName.getPDFName(ExtGSName.getName());
PDExtendedGraphicsState ExtGState =((PDFormXObject) xObject).getResources().getExtGState(ExtGSName);
ExtGState.getCOSObject().setItem(COSName.NAME, new_name);
extgsdict.setItem(new_name,ExtGState);
}
imgdict.setItem(xObjectName, xObject);
for (COSName Imgname :((PDFormXObject) xObject).getResources().getXObjectNames() )
{
COSName new_name = COSName.getPDFName(Imgname.getName());
xObject.getCOSObject().setItem(COSName.NAME, new_name);
PDXObject img =((PDFormXObject) xObject).getResources().getXObject(Imgname);
imgdict.setItem(new_name, img);
}
for (COSName paternname :((PDFormXObject) xObject).getResources().getPatternNames() )
{
COSName new_name = COSName.getPDFName(paternname.getName());
PDAbstractPattern pat = ((PDFormXObject) xObject).getResources().getPattern(paternname);
pat.getCOSObject().setItem(COSName.NAME, new_name);
pattern.setItem(new_name,pat);
}
for (int k=0; k< tokens1.size(); k++) {
if ( ((tokens1.get(k) instanceof Operator) && ((Operator)tokens1.get(k)).getName().toString().equals("Do"))
&& ((COSName)tokens1.get(k-1)).getName().toString().equals(xObjectName.getName().toString()) ) {
tokens1.remove(k-1);
tokens1.remove(k-1);
tokens1.add(k-1, Operator.getOperator("q"));
if(((PDFormXObject) xObject).getMatrix() != null) {
tokens1.add(k, new COSFloat(((PDFormXObject) xObject).getMatrix().getScaleX()));
tokens1.add(k + 1, new COSFloat(((PDFormXObject) xObject).getMatrix().getShearY()));
tokens1.add(k + 2, new COSFloat(((PDFormXObject) xObject).getMatrix().getShearX()));
tokens1.add(k + 3, new COSFloat(((PDFormXObject) xObject).getMatrix().getScaleY()));
tokens1.add(k + 4, new COSFloat(((PDFormXObject) xObject).getMatrix().getTranslateX()));
tokens1.add(k + 5, new COSFloat(((PDFormXObject) xObject).getMatrix().getTranslateY()));
tokens1.add(k + 6, Operator.getOperator("cm"));
tokens1.add(k+7, Operator.getOperator("Q"));
ind =k+7;
}else{
tokens1.add(k, Operator.getOperator("Q"));
ind =k;
}
break;
}
}
for (int k=0; k< tokens3.size(); k++) {
if ( (tokens3.size() > k+1) && (tokens3.get(k+1) instanceof Operator) && (((Operator)tokens3.get(k+1)).getName().toString().equals("Do")
|| ((Operator)tokens3.get(k+1)).getName().toString().equals("gs")
|| ((Operator)tokens3.get(k+1)).getName().toString().equals("cs")
|| ((Operator)tokens3.get(k+1)).getName().toString().equals("CS")) ) {
COSName new_name = COSName.getPDFName( ((COSName) tokens3.get(k)).getName() );
tokens1.add(ind+k, new_name );
}else if ( (tokens3.size() > k+2) && (tokens3.get(k+2) instanceof Operator)
&& ((Operator)tokens3.get(k+2)).getName().toString().equals("Tf") ) {
COSName new_name = COSName.getPDFName( ((COSName) tokens3.get(k)).getName() );
tokens1.add(ind+k, new_name );
}
else
tokens1.add(ind+k,tokens3.get(k));
}
img_count +=1;
}else {
imgdict.setItem(xObjectName, xObject);
img_count +=1;
}
}else
imgdict.setItem(xObjectName, xObject);
}
for (COSName fontName :new_resources.getFontNames() )
{
PDFont font =new_resources.getFont(fontName);
fntdict.setItem(fontName,font);
}
for (COSName ExtGSName :new_resources.getExtGStateNames() )
{
PDExtendedGraphicsState extg =new_resources.getExtGState(ExtGSName);
extgsdict.setItem(ExtGSName,extg);
}
for (COSName colorname :new_resources.getColorSpaceNames() )
{
PDColorSpace color =new_resources.getColorSpace(colorname);
colordict.setItem(colorname,color);
}
for (COSName patern :new_resources.getPatternNames() )
{
PDAbstractPattern pat =new_resources.getPattern(patern);
pattern.setItem(patern,pat);
}
resources.getCOSObject().setItem(COSName.EXT_G_STATE,extgsdict);
resources.getCOSObject().setItem(COSName.FONT,fntdict);
resources.getCOSObject().setItem(COSName.XOBJECT,imgdict);
resources.getCOSObject().setItem(COSName.COLORSPACE, colordict);
resources.getCOSObject().setItem(COSName.PATTERN, pattern);
writer.writeTokens(tokens1);
out.close();
document.getPage(pg_ind).setContents(newContents);
document.getPage(pg_ind).setResources(resources);
return document;
}
private static JSONObject getTokens(PDDocument oldDocument, Integer pageIndex) throws IOException {
// TODO Auto- it will return the tokens of pdf
JSONObject oldDocumentTokens = new JSONObject();
PDPage pg = oldDocument.getPage(pageIndex);
PDFStreamParser parser = new PDFStreamParser(pg);
parser.parse();
List<Object> tokens = PDFUtils.removeTokens(parser.getTokens());
oldDocumentTokens.put(pageIndex, tokens);
return oldDocumentTokens;
}
private static boolean isTextContains(List<Object> tokens3) {
for (int k=0; k< tokens3.size(); k++) {
if (tokens3.get(k) instanceof Operator) {
Operator op = (Operator) tokens3.get(k);
if(op.getName().equals("BT"))
return true;
}
}
return false;
}
But I am unable to get Exact Page graphics. I am losing something.
input pdf
output pdf

There are multiple issues, some in details, some in the concept.
Wrapping in a save-graphics-state/restore-graphics-state envelope
When you draw an XObject, graphics state changes in that XObject don't change your current graphics state. To make sure this still is true after you copied the XObject instructions into your page content stream, you have to wrap that block into a save-graphics-state/restore-graphics-state envelope (q ... Q). You can do that by adding these two lines
tokens1.add(ind++, Operator.getOperator("q"));
tokens1.add(ind, Operator.getOperator("Q"));
right before your instruction copying loop
for (int k=0; k< tokens3.size(); k++) {
...
}
Coordinate system
You assume the coordinate system in the XObject equals that of the page. It doesn't necessarily. XObjects may have a Matrix entry denoting the transformation to apply.
Boundary box
You don't limit the area of what is drawn by the XObject instructions. But XObjects have a BBox entry denoting the box to clip the outputs to.
Optional content
XObjects may also have an OC entry denoting their optional content membership. Such a membership needs to be transformed into an equivalent optional content tagging.
Marked content, structure tree
XObjects can also refer to the structural parent tree via their StructParent or StructParents entry. To keep structural integrity of the document, you may have to considerably update the structure tree.
Grouping
XObjects may contain a Group entry indicating that its content shall be treated as a group. In particular in case of Transparency Groups this results in a different behavior of transparency related features than for the same instructions copied into the page content.
Unless you completely analyze the effects of each bit of content drawn with some transparency and from case to case rewrite the instructions drawing it, copying the instructions from the XObject to the page content stream will result in substantial differences in the displayed content.
Usage
Your code assumes that a XObject is used exactly once in the page content streams. This need not be the case, it can also be used more often or not at all.
References
In a comment you asked for references. Actually it's all in the PDF specification ISO 32000, already in the publicly available ISO 32000-1:
8.10 Form XObjects
A form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects (including path objects, text objects, and sampled images). A form XObject may be painted multiple times—either on several pages or at several locations on the same page—and produces the same results each time, subject only to the graphics state at the time it is invoked.
Thus, any number of usages on a given page is possible
When the Do operator is applied to a form XObject, a conforming reader shall perform the following tasks:
a) Saves the current graphics state, as if by invoking the q operator (see 8.4.4, "Graphics State Operators")
b) Concatenates the matrix from the form dictionary’s Matrix entry with the current transformation matrix (CTM)
c) Clips according to the form dictionary’s BBox entry
d) Paints the graphics objects specified in the form’s content stream
e) Restores the saved graphics state, as if by invoking the Q operator (see 8.4.4, "Graphics State Operators")
When copying into the page content stream, therefore, you should equivalently use a q/Q envelope and respect the Matrix and BBox entries.
8.11.3.3 Optional Content in XObjects and Annotations
In addition to marked content within content streams, form XObjects and image XObjects (see 8.8, "External Objects") and annotations (see 12.5, "Annotations") may contain an OC entry, which shall be an optional content group or an optional content membership dictionary.
A form or image XObject's visibility shall be determined by the state of the group or those of the groups referenced by the membership dictionary in conjunction with its P (or VE) entry, along with the current visibility state in the context in which the XObject is invoked (that is, whether objects are visible in the contents stream at the place where the Do operation occurred).
Thus, respect this optional content information when copying to the page content.
11.6.6 Transparency Group XObjects
A transparency group is represented in PDF as a special type of group XObject (see “Group XObjects”) called a transparency group XObject. A group XObject is in turn a type of form XObject, distinguished by the presence of a Group entry in its form dictionary (see “Form Dictionaries”). The value of this entry is a subsidiary group attributes dictionary defining the properties of the group. The format and meaning of the dictionary’s contents shall be determined by its group subtype, which is specified by the dictionary’s S entry. The entries for a transparency group (subtype Transparency) are shown in Table 147.
...
Annex L
So copying from transparency groups may change the appearance substantially.
14.7.4.3 PDF Objects as Content Items
When a structure element’s content includes an entire PDF object, such as an XObject or an annotation, that is associated with a page but not directly included in the page’s content stream, the object shall be identified in the structure element’s K entry by an object reference dictionary (see Table 325).
...
14.7.4.4 Finding Structure Elements from Content Items
...
To locate the relevant parent tree entry, each object or content stream that is represented in the tree shall contain a special dictionary entry, StructParent or StructParents (see Table 326). Depending on the type of content item, this entry may appear in the page object of a page containing marked-content sequences, in the stream dictionary of a form or image XObject, in an annotation dictionary, or in any other type of object dictionary that is included as a content item in a structure element.
This and more information from the same chapter should indicate clearly that structure information after copying from XObject to page content must be overhauled.

This is just snippet code for above MKL Answer. I am trying to give snippet code for that
Wrapping in a save-graphics-state/restore-graphics-state envelope
When you draw an XObject, graphics state changes in that XObject don't change your current graphics state. To make sure this still is true after you copied the XObject instructions into your page content stream, you have to wrap that block into a save-graphics-state/restore-graphics-state envelope (q ... Q). You can do that by adding these two lines
tokens1.add(ind++, Operator.getOperator("q"));
tokens1.add(ind, Operator.getOperator("Q"));
right before your instruction copying loop
for (int k=0; k< tokens3.size(); k++) {
...
}
Coordinate system:
You assume the coordinate system in the XObject equals that of the page. It doesn't necessarily. XObjects may have a Matrix entry denoting the transformation to apply.
tokens1.add(k-1, Operator.getOperator("q"));
if(((PDFormXObject) xObject).getMatrix() != null) {
tokens1.add(k, new COSFloat(((PDFormXObject) xObject).getMatrix().getScaleX()));
tokens1.add(k + 1, new COSFloat(((PDFormXObject) xObject).getMatrix().getShearY()));
tokens1.add(k + 2, new COSFloat(((PDFormXObject) xObject).getMatrix().getShearX()));
tokens1.add(k + 3, new COSFloat(((PDFormXObject) xObject).getMatrix().getScaleY()));
tokens1.add(k + 4, new COSFloat(((PDFormXObject) xObject).getMatrix().getTranslateX()));
tokens1.add(k + 5, new COSFloat(((PDFormXObject) xObject).getMatrix().getTranslateY()));
tokens1.add(k + 6, Operator.getOperator("cm"));
tokens1.add(k+7, Operator.getOperator("Q"));
ind =k+7;
}
Boundary box:
You don't limit the area of what is drawn by the XObject instructions. But XObjects have a BBox entry denoting the box to clip the outputs to.
if ((PDFormXObject) xObject).getBBox()!= null){
//How can I add this bbox property? is it 're'?
tokens1.add(k, new COSFloat(((PDFormXObject) xObject).getBBox().getLowerLeftX()));
tokens1.add(k+1, new COSFloat(((PDFormXObject) xObject).getBBox().getLowerLeftY()));
tokens1.add(k+2, new COSFloat(((PDFormXObject) xObject).getBBox().getWidth()));
tokens1.add(k+3, new COSFloat(((PDFormXObject) xObject).getBBox().getHeight()));
tokens1.add(k+4, Operator.getOperator("re"));
tokens1.add(k+5, Operator.getOperator("W"));
tokens1.add(k+6, Operator.getOperator("n"));
}
Optional content
XObjects may also have an OC entry denoting their optional content membership. Such a membership needs to be transformed into an equivalent optional content tagging.
//How can I get this oc property from xobject and how can I use it?
Marked content, structure tree
//For now there is no any marked content. Assume every pdf is not Tagged.
Grouping
XObjects may contain a Group entry indicating that its content shall be treated as a group. In particular in case of Transparency Groups this results in a different behavior of transparency related features than for the same instructions copied into the page content.
Unless you completely analyze the effects of each bit of content drawn with some transparency and from case to case rewrite the instructions drawing it, copying the instructions from the XObject to the page content stream will result in substantial differences in the displayed content.
if ((PDFormXObject) xObject).getGroup() != null
//if this is not null how to use this?
Usage
Your code assumes that a XObject is used exactly once in the page content streams. This need not be the case, it can also be used more often or not at all.
//For this I will iterate my main content stream and replace all formxobject's.
References
I am adding snippet to copy all references.
Reading references
for (COSName colorname :((PDFormXObject) xObject).getResources().getColorSpaceNames())
{
COSName new_name = COSName.getPDFName(colorname.getName());
PDColorSpace pdcolor = ((PDFormXObject) xObject).getResources().getColorSpace(colorname);
colordict.setItem(new_name,pdcolor);
}
for (COSName propertyName :((PDFormXObject) xObject).getResources().getPropertiesNames())
{
COSName new_name = COSName.getPDFName(propertyName.getName()+"_Fm"+img_count);
PDPropertyList property =((PDFormXObject) xObject).getResources().getProperties(propertyName);
property.getCOSObject().setItem(COSName.NAME, new_name);
propertiesdict.setItem(new_name,property);
}
for (COSName shadeName :((PDFormXObject) xObject).getResources().getShadingNames() )
{
COSName new_name = COSName.getPDFName(shadeName.getName()+"_Fm"+img_count);
PDShading shade =((PDFormXObject) xObject).getResources().getShading(shadeName);
shade.getCOSObject().setItem(COSName.NAME, new_name);
fntdict.setItem(new_name,shade);
}
for (COSName fontName :((PDFormXObject) xObject).getResources().getFontNames() )
{
COSName new_name = COSName.getPDFName(fontName.getName());
PDFont font =((PDFormXObject) xObject).getResources().getFont(fontName);
font.getCOSObject().setItem(COSName.NAME, new_name);
fntdict.setItem(new_name,font);
}
for (COSName ExtGSName :((PDFormXObject) xObject).getResources().getExtGStateNames() )
{
COSName new_name = COSName.getPDFName(ExtGSName.getName());
PDExtendedGraphicsState ExtGState =((PDFormXObject) xObject).getResources().getExtGState(ExtGSName);
ExtGState.getCOSObject().setItem(COSName.NAME, new_name);
extgsdict.setItem(new_name,ExtGState);
}
imgdict.setItem(xObjectName, xObject);
for (COSName Imgname :((PDFormXObject) xObject).getResources().getXObjectNames() )
{
COSName new_name = COSName.getPDFName(Imgname.getName());
xObject.getCOSObject().setItem(COSName.NAME, new_name);
PDXObject img =((PDFormXObject) xObject).getResources().getXObject(Imgname);
imgdict.setItem(new_name, img);
}
for (COSName paternname :((PDFormXObject) xObject).getResources().getPatternNames() )
{
COSName new_name = COSName.getPDFName(paternname.getName());
PDAbstractPattern pat = ((PDFormXObject) xObject).getResources().getPattern(paternname);
pat.getCOSObject().setItem(COSName.NAME, new_name);
pattern.setItem(new_name,pat);
}
//Later I am placing in place
for (COSName fontName :new_resources.getFontNames() )
{
PDFont font =new_resources.getFont(fontName);
fntdict.setItem(fontName,font);
}
for (COSName ExtGSName :new_resources.getExtGStateNames() )
{
PDExtendedGraphicsState extg =new_resources.getExtGState(ExtGSName);
extgsdict.setItem(ExtGSName,extg);
}
for (COSName colorname :new_resources.getColorSpaceNames() )
{
PDColorSpace color =new_resources.getColorSpace(colorname);
colordict.setItem(colorname,color);
}
for (COSName patern :new_resources.getPatternNames() )
{
PDAbstractPattern pat =new_resources.getPattern(patern);
pattern.setItem(patern,pat);
}
resources.getCOSObject().setItem(COSName.EXT_G_STATE,extgsdict);
resources.getCOSObject().setItem(COSName.FONT,fntdict);
resources.getCOSObject().setItem(COSName.XOBJECT,imgdict);
resources.getCOSObject().setItem(COSName.COLORSPACE, colordict);
resources.getCOSObject().setItem(COSName.PATTERN, pattern);
resources.getCOSObject().setItem(COSName.PROPERTIES, propertiesdict);
//what about shading How can I add here
writer.writeTokens(tokens1);
out.close();
document.getPage(pg_ind).setContents(newContents);
document.getPage(pg_ind).setResources(resources);
#mkl please correct me here to give complete solution. I will try hard to make this work Thanks in advance.

PDFBox invalid option in radio

When trying to fill the form of this pdf (http://vaielab.com/Test/2.pdf) with this code
PDDocument pdfDocument = PDDocument.load(new File("2.pdf"));
pdfDocument.setAllSecurityToBeRemoved(true);
PDDocumentCatalog docCatalog = pdfDocument.getDocumentCatalog();
PDAcroForm acroForm = docCatalog.getAcroForm();
if (acroForm != null) {
PDField field = (PDField) acroForm.getField("rad2");
try {
field.setValue("0");
} catch (Exception e) {
System.out.println(e);
}
}
pdfDocument.save("output.pdf");
pdfDocument.close();
I get this error: value '0' is not a valid option for the field rad2, valid values are: [Yes] and Off
But value "0" should be a valid option, and if I do a dump_data_fields with pdftk, I get this:
FieldType: Button
FieldName: rad2
FieldFlags: 49152
FieldJustification: Left
FieldStateOption: 0
FieldStateOption: 1
FieldStateOption: Off
FieldStateOption: Yes
I also tried the value "1" but get the exact same error.
I'm using pdfbox 2.0.20

This is because of the Opt values in Root/AcroForm/Fields/[7]/Opt, that one has two "Yes" entries only. The PDButton.setValue() code in PDFBox updates this field differently when /Opt is set. The best here would be not to set it, or remove these entries by calling field.setExportValues(null) . Then valid settings would be 0, 1 and "Off".

How to substitute missing font when filling a form with PDFBox?

I'm trying to fill out a bunch of PDF Forms using PDFBox 2.0.8. For some documents I get the following error when setting the PDTextField's value:
java.io.IOException: Could not find font: /ArialMT
Apparently the font is not correctly embedded as is often the case with proprietary Microsoft fonts.
How can I tell PDFBox to substitute the font e.g. with "normal" Arial or some other font? Setting the fields DA string to "/Helv 0 tf 0 g" resulted in a NullPointerException.

Based on the comments from Tilman Hausherr I built a first fix which works independent from the operating system (which is a Linux in my case).
acroForm.defaultResources.put(COSName.getPDFName("ArialMT"),
PDType0Font.load (pdDocument, this.javaClass.classLoader.getResourceAsStream("fonts/ARIALMT.ttf"), false))
This will only work for this particular font, though. What's still missing - and was actually the main intention of my question - is an option to tell PDFBox to fall back to a certain font resp. DA if the font that is required cannot be provided.
After Tilman again came for the rescue I can now present the complete solution. Again, this is Kotlin, not Java:
PDDocument.load(file).use { pdDocument ->
val acroForm = pdDocument.documentCatalog.acroForm
acroForm.defaultResources.put(COSName.getPDFName("ArialMT"),
PDType0Font.load (pdDocument, this.javaClass.classLoader.getResourceAsStream("fonts/ARIALMT.ttf"), false))
val pdField: PDField? = acroForm.getField(fieldname)
val value = ...
when (pdField) {
is PDCheckBox -> {
if (value is Boolean) {
when (value) {
true -> pdField.check()
false -> pdField.unCheck()
}
} else {
log.error("RENDER_FORM: Need Boolean for ${pdField.fullyQualifiedName} but got $value")
}
}
is PDTextField -> {
try {
pdField.value = value?.toString() ?: ""
} catch (ioException: IOException) {
pdField.cosObject.setString(COSName.DA, "/Helv 0 Tf 0 g")
pdField.value = value?.toString() ?: ""
log.error("RENDER_FORM: Writing text field failed: ${ioException.message}")
}
}
null -> {
log.error("RENDER_FORMULAR: Formfield $fieldname does not exist in $name")
}
else -> log.error("RENDER_FORMULAR: Formfield $pdField ($fieldname) is of unhandled type ${pdField.fieldType}")
}
val stream = ByteArrayOutputStream()
pdDocument.save(stream)
pdDocument.close()
return stream.toByteArray()
}

Add "ArialMT" to the default resources:
try (PDDocument doc = PDDocument.load(new File("F2_Datenblatt_022015.pdf")))
{
PDAcroForm acroForm = doc.getDocumentCatalog().getAcroForm();
PDField field = acroForm.getField("Vorname_Name");
// fails with IOException as described in question
//field.setValue("Tilman Hausherr");
// Method 1, just add type1 Helvetica (allows only WinAnsiEncoding glyphs)
//acroForm.getDefaultResources().put(COSName.getPDFName("ArialMT"), PDType1Font.HELVETICA);
// Method 2, add the full Arial font (allows for more different glyphs)
// important: use the method that switches off subsetting
acroForm.getDefaultResources().put(
COSName.getPDFName("ArialMT"),
PDType0Font.load(doc, new FileInputStream("c:/windows/fonts/arial.ttf"), false));
field.setValue("Tilman Hausherr");
doc.save("F2_Datenblatt_022015-mod.pdf");
}
Update:
Turns out the code in the question would have worked too with the file - almost. It's "Tf" and not "tf", so the string would have been "/Helv 0 Tf 0 g". We'll research how to avoid an NPE and get a meaningful exception.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.